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Preface to the Third Edition 


For some forty years the first and second editions of this book have been 
used by students to acquire a basic knowledge of the theory and methods of 
multivariate statistical analysis. The book has also served a wider community 
of statisticians in furthering their understanding and proficiency in this field. 
Since the second edition was published, multivariate analysis has been 
developed and extended in many directions. Rather than attempting to cover, 
or even survey, the enlarged scope, I have elected to elucidate several aspects 
that are particularly interesting and useful for methodology and comprehen¬ 
sion. 

Earlier editions included some methods that could be carried out on an 
adding machine! In the twenty-first century ， however, computational tech¬ 
niques have become so highly developed and improvements come so rapidly 
that it is impossible to include all of the relevant methods in a volume on the 
general mathematical theory. Some aspects of statistics exploit computational 
power such as the resampling technologies; these are not covered here. 

The definition of multivariate statistics implies the treatment of variables 
that are interrelated. Several chapters are devoted to measures of correlation 
and tests of independence- A new chapter, “Patterns of Dependence; Graph¬ 
ical Models” has been added. A so-called graphical model is a set of vertices 
or nodes identifying observed variables together with a new set of edges 
suggesting dependences between variables. The algebra of such graphs is an 
outgrowth and development of path analysis and the study of causal chains. 
A graph may represent a sequence in time or logic and may suggest causation 
of one set of variables by another set. 

Another new topic systematically presented in the third edition is that of 
elliptically contoured distributions. The multivariate normal distribution, 
which is characterized by the mean vector and covariance matrix，has a 
limitation that the fourth-order moments of the variables are determined by 
the first- and second-order moments. The class .of elliptically contoured 
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PREFACE TO THE THIRD EDITION 


distribution relaxes this restriction. A density in this class has contours of 
equal density which are ellipsoids as does a normal density, but the set of 
fourth-order moments has one further degree of freedom. This topic is 
expounded by the addition of sections to appropriate chapters. 

Reduced rank regression developed in Chapters 12 and 13 provides a 
method of reducing the number of regression coefficients to be estimated in 
the regression of one set of variables to another. This approach includes the 
limited-information maximum-likelihood estimator of an equation in a simul¬ 
taneous equations model. 

The preparation of the third edition has been benefited by advice and 
comments of readers of the first and second editions as well as by reviewers 
of the current revision. In addition to readers of the earlier editions listed in 
those prefaces I want to thank Michael Perlman and Kathy Richards for their 
assistance in getting this manuscript ready. 

T. W. Anderson 

Stanford, California 
February 2003 



Preface to the Second Edition 


Twenty-six years have passed since the first edition of this book was pub¬ 
lished. During that time great advances have been made in multivariate 
statistical analysis — particularly in the areas treated in that volume. This new 
edition purports to bring the original edition up to date by substantial 
revision, rewriting, and additions. The basic approach has been maintained, 
namely, a mathematically rigorous development of statistical methods for 
observations consisting of several measurements or characteristics of each 
subject and a study of their properties. The general outline of topics has been 
retained. 

The method of maximum likelihood has been augmented by other consid¬ 
erations. In point estimation of the mean vector and covariance matrix 
alternatives to the maximum likelihood estimators that are better with 
respect to certain loss functions, such as Stein and Bayes estimators, have 
been introduced. In testing hypotheses likelihood ratio tests have been 
supplemented by other invariant procedures. New results on distributions 
and asymptotic distributions are given; some significant points are tabulated. 
Properties of these procedures, such as power functions, admissibility, unbi¬ 
asedness, and monotonicity of power functions, are studied* Simultaneous 
confidence intervals for means and covariances are developed. A chapter on 
factor analysis replaces the chapter sketching miscellaneous results in the 
first edition. Some new topics, including simultaneous equations models and 
linear functional relationships，are introduced. Additional problems present 
further results. 

It is impossible to cover all relevant material in this book; what seems 
most important has been included. For a comprehensive listing of papers 
until 1966 and books until 1970 the reader is referred to A Bibliography of 
Multivariate Statistical Analysis by Anderson, Das Gupta, and Styan (1972). 
Further references can be found in Multivariate Analysis: A Selected and 
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Abstracted Bibliography, 1957-1972 by Subrahmaniam and Subrahmaniam 
(1973). 

I am in debt to many students, colleagues, and friends for their suggestions 
and assistance; they include Yasuo Amemiya, James Berger, Byoung-Seon 
Choi. Arthur Cohen, Margery Cruise, Somesh Das Gupta, Kai-Tai Fang, 
Gene Golub, Aaron Han, Takeshi Hayakawa, Jogi Henna, Huang Hsu ? Fred 
Huffer, Mituaki Huzii, Jack Kiefer, Mark Knowles, Sue Leurgans, Alex 
McMillan, Masashi No, Ingram Olkin, Kartik Patel, Michael Perlman, Allen 
Sampson, Ashis Sen Gupta, Andrew Siegel，Charles Stein, Patrick Strout, 
Akimichi Takemura, Joe Verducci，Marios Viana, and Y. Yajima. I was 
helped in preparing the manuscript by Dorothy Anderson, Alice Lundin, 
Amy Schwartz, and Pat St ruse. Special thanks go to Johan ne Thiffault and 
George P. H、Sty an for their precise attention. Support was contributed by 
the Army Research Office, the National Science Foundation, the Office of 
Naval Research, and IBM Systems Research Institute. 

Seven tables of significance points are given in Appendix B to facilitate 
carrying out test procedures. Tables 1 ， 5, and 7 are Tables 47, 50, and 53, 
respectively, of Biometrika Tables for Statisticians, Vol_ 2, by E. S. Pearson 
and H. 0 、 Hartley; permission of the Biometrika Trustees is hereby acknowl¬ 
edged. Table 2 is made up from three tables prepared by A. W. Davis and 
published in Biometrika (1970a), Annals of the Institute of Statistical Mathe- 
niaucs (1970b) and Communications in Statistics, B. Simulation and Computa¬ 
tion (19801 Tables 3 and 4 are Tables 6.3 and 6.4, respectively, of Concise 
Statistical Tables ， edited by Ziro Yamauti (1977) and published by the 
Japanese Standards Association; this book is a concise version of Statistical 
Tables and Formulas with Computer Applications, JSA-1972. Table 6 is Table 3 
of The Distribution of the Sphericity Test Criterion, ARL 72-0154, by B. N. 
Nagarsenker and K. C. S. Pillai，Aerospace Research Laboratories (1972). 
The author is indebted to the authors and publishers listed above for 
permission to reproduce these tables. 


Sianford. California 
June 1984 


T. W. Anderson 



Preface to the First Edition 


This book has been designed primarily as a text for a two-semester course in 
multivariate statistics. It is hoped that the book will also serve as an 
introduction to many topics in this area to statisticians who are not students 
and will be used as a reference by other statisticians. 

For several years the book in the form of dittoed notes has been used in a 
twosemester sequence of graduate courses at Columbia University; the first 
six chapters constituted the text for the first semester, emphasizing correla¬ 
tion theory. It is assumed that the reader is familiar with the usual theory of 
univariate statistics, particularly methods based on the univariate normal 
distribution. A knowledge of matrix algebra is also a prerequisite; however, 
an appendix on this topic has been included. 

It is hoped that the more basic and important topics are treated here, 
though to some extent the coverage is a matter of taste. Some of the more 
recent and advanced developments are only briefly touched on in the late 
chapter. 

The method of maximum likelihood is used to a large extent. This leads to 
reasonable procedures; in some cases it can be proved that they are optimal. 
In many situations, however, the theory of desirable or optimum procedures 
is lacking. 

Over the years this manuscript has been developed, a number of students 
and colleagues have been of considerable assistance. Allan Bimbaum, Harold 
Hotelling, Jacob Horowitz, Howard Levene, Ingram Olkin, Gobind Seth, 
Charles Stein, and Heniy Teicher are to be mentioned particularly* Acknowl¬ 
edgements are also due to other members of the Graduate Mathematical 
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Statistics Society at Columbia University for aid in the preparation of the 
manuscript in dittoed form* The preparation of this manuscript was sup¬ 
ported in part by the Office of Naval Research. 


T. W. Anderson 


Center for Advanced Study 
in the Behauioral Sciences 
Stanford, California 
December 1957 



CHAPTER 1 


Introduction 


1 丄 MULTIVARIATE STATISTICAL ANALYSIS 

Multivariate statistical analysis is concerned with data that consist of sets of 
measurements on a number of individuals or objects. The sample data may 
be heights and weights of some individuals drawn randomly from a popula¬ 
tion of school children in a given city, or the statistical treatment may be 
made on a collection of measurements, such as lengths and widths of petals 
and lengths and widths of sepals of iris plants taken from two species, or one 
may study the scores on batteries of mental tests administered to a number of 
students. 

The measurements made on a single individual can be assembled into a 
column vector, We think of the entire vector as an observation from a 
multivariate population or distribution. When the individual is drawn ran¬ 
domly, we consider the vector as a random vector with a distribution or 
probability law describing that population. The set of observations on all 
individuals in a sample constitutes a sample of vectors, and the vectors set 
side by side make up the matrix of observations. 1 * The data to be analyzed 
then are thought of as displayed in a matrix or in several matrices. 

We shall see that it is helpful in visualizing the data and understanding the 
methods to think of each observation vector as constituting a point in a 
Euclidean space，each coordinate corresponding to a measurement or vari¬ 
able, Indeed，an early step in the statistical analysis is plotting the data; since 


t When data are listed on paper by individual, it is natural to print the measurements on one 
individual as a row of the table; then one individual corresponds to a row vector. Since we prefer 
to operate algebraically with column vectors, we have chosen to treat observatioos b terms of 
column vectors. (In practice, the basic data set may well be on caids, tapes, or disks.) 
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INTRODUCTION 


most statisticians are limited to two-dimensional plots, two coordinates of the 
observation are plotted in turn. 

Characteristics of a univariate distribution of essential interest are the 
mean as a measure of location and the standard deviation as a measure of 
variability; similarly the mean and standard deviation of a univariate sample 
are important summary measures. In multivariate analysis, the means and 
variances of the separate measurements — for distributions and for samples 
— have corresponding relevance. An essential aspect, however, of multivari¬ 
ate analysis is the dependence between the different variables. The depen¬ 
dence between two variables may involve the covariance between them, that 
is, the average products of their deviations from their respective means. The 
covariance standardized by the corresponding standard deviations is the 
correlation coefficient; it serves as a measure of degree of dependence* A set 
of summary statistics is the mean vector (consisting of the univariate means) 
and the covariance matrix (consisting of the univariate variances and bivari¬ 
ate covariances)、An alternative set of summary statistics with the same 
information is the mean vector, the set of standard deviations, and the 
correlation matrix. Similar parameter quantities describe location, variability, 
and dependence in the population or for a probability distribution. The 
multivariate nomial distribution is completely determined by its mean vector 
and covariance matrix，and the sample mean vector and covariance matrix 
constitute a sufficient set of statistics. 

The measurement and analysis of dependence between variables，between 
sets of variables, and between variables and sets of variables are fundamental 
to multivariate analysis. The multiple correlation coefficient is an extension 
of the notion of correlation to the relationship of one variable to a set of 
variables. The partial correlation coefficient is a measure of dependence 
between two variables when the effects of other correlated variables have 
been removed The various correlation coefficients computed from samples 
are used to estimate corresponding correlation coefficients of distributions. 
In thus book tests of hypotheses of independence are developed. The proper¬ 
ties of the estimators and test procedures are studied for sampling from the 
multivariate normal distribution. 

A number of statistical problems arising in multivariate populations are 
straightforward analogs of problems arising in univariate populations ； the 
suitable methods for handling these problems are similarly related. For 
example, in the univariate case we may wish to test the hypothesis that the 
mean of a variable is zero; in the multivariate case we may wish to test the 
hypothesis that the vector of the means of several variables is the zero vector. 
The analog of the Student Mest for the first hypothesis is the generalized 
r 2 -test. The analysis of variance of a single variable is adapted to vector 
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observations; in regression analysis, the dependent quantity may be a vector 
variable. A comparison of variances is generalized into a comparison of 
covariance matrices. 

The test procedures of univariate statistics are generalized to the multi¬ 
variate case in such ways that the dependence between variables is taken into 
account. These methods may not depend on the coordinate system; that is, 
the procedures may be invariant with respect to linear transformations that 
leave the null hypothesis invariant. In some problems there may be families 
of tests that are invariant; then choices must be made. Optimal properties of 
the tests are considered. 

For some other purposes ， however, it may be important to select a 
coordinate system so that the variates have desired statistical properties. One 
might say that they involve characterizations of inherent properties of normal 
distributions and of samples. These are closely related to the algebraic 
problems of canonical forms of matrices. An example is finding the normal¬ 
ized linear combination of variables with maximum or minimum variance 
(finding principal components); this amounts to finding a rotation of axes 
that carries the covariance matrix to diagonal form. Another example is 
characterizing the dependence between two sets of variates (finding canoni¬ 
cal correlations). These problems involve the characteristic roots and vectors 
of various matrices. The statistical properties of the corresponding sample 
quantities are treated. 

Some statistical problems arise in models in which means and covariances 
are restricted. Factor analysis may be based on a model with a (population) 
covariance matrix that is the sum of a positive definite diagonal matrix and a 
positive semidefinite matrix of low rank; linear structural relationships may 
have a similar formulation. The simultaneous equations system of economet¬ 
rics is another example of a special model 


1.2. THE MULTIVARIATE NORMAL DISTRIBUTION 

The statistical methods treated in this book can be developed and evaluated 
in the context of the multivariate normal distribution, though many of the 
procedures are useful and effective when the distribution sampled is not 
normal. A major reason for basing statistical analysis on the normal distribu¬ 
tion is that this probabilistic model approximates well the distribution of 
continuous measurements in many sampled populations. In fact, most of the 
methods and theory have been developed to serve statistical analysis of data. 
Mathematicians such as Adrian (1808)，Laplace (1811)，Plana (1813)，Gauss 
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(1823), and Bravais (1846) studied the bivariate normal density. Francis 
Galton, the geneticist, introduced the ideas of correlation, regression, and 
homosceelasticity in the study of pairs of measurements，one made on a 
parent and one in an offspring. [See, e.g., Galton (1889).] He enunciated the 
theory of the multivariate normal distribution as a generalization of observed 
properties of samples, 

Karl Pearson and others carried on the development of the theory and use 
of different kinds of correlation coefficients 卞 for studying problems in genet- 
ics ， biology，and other fields. R. A. Fisher further developed methods for 
agriculture, botany, and anthropology, including the discriminant function for 
classification problems. In another direction，analysis of scores o l mental 
tests led to a theory, including factor analysis ，the sampling theory of which is 
based on the normal distribution. In these cases，as well as in agricultural 
experiments, in engineering problems, in certain economic problems, and in 
other fields，the multivariate normal distributions have been found to be 
sufficiently close approximations to the populations so that statistical analy¬ 
ses based on these models are justified. 

The univariate normal distribution arises frequently because the effect 
studied is the sum of many independent random effects. Similarly, the 
multivariate normal distribution often occurs because the multiple measure¬ 
ments are sums of small independent effects. Just as the central limit 
theorem leads to the univariate normal distribution for single variables, so 
does the general central limit theorem for several variables lead to the 
multivariate normal distribution. 

Statistical theory based on the normal distribution has the advantage that 
the multivariate methods based on it are extensively developed and can be 
studied in an organized and systematic way. This is due not only to the need 
for such methods because they are of practical uso, but also to the fact that 
normal theory is amenable to exact mathematical treatment. The suitable 
methods of analysis are mainly based on standard operations of matrix 
algebra; the distributions of many statistics involved can be obtained exactly 
or at least characterized; and in many cases optimum properties of proce¬ 
dures can be deduced. 

The point of view in this book is to state problems of inference in terms of 
the multivariate normal distributions, develop efficient and often optimum 
methods in this context, and evaluate significance and confidence levels in 
these terms. This approach gives coherence and rigor to the exposition ， but, 
by its very nature, cannot exhaust consideration of multivariate statistical 
analysis. The procedures are appropriate to many nonnormal distributions, 


For a detailed study of the development of the ideas of correlation, see Walker (1931), 
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but their adequacy may be open to question. Roughly speaking, inferences 
about means are robust because of the operation of the central limit 
theorem，but inferences about covariances are sensitive to normality, the 
variability of sample covariances depending on fourth-order moments. 

This inflexibility of normal methods with respect to moments of order 
greater than two can be reduced by including a larger class of elliptically 
contoured distributions. In the univariate case the normal distribution is 
determined by the mean and variance; higher-order moments and properties 
such as peakedness and long tails are functions of the mean and variance. 
Similarly，in the multivariate case the means and covariances or the means, 
variances, and correlations determine all of the properties of the distribution. 
That limitation is alleviated in one respect by consideration of a broad class 
of elliptically contoured distributions. That class maintains the dependence 
structure, but permits more general peakedness and long tails. This study 
leads to more robust methods. 

The development of computer technology has revolutionized multivariate 
statistics in several respects. As in univariate statistics, modern computers 
permit the evaluation of observed variability and significance of results by 
resampling methods，such as the bootstrap and cross-validation. Such 
methodology reduces the reliance on tables of significance points as well as 
eliminates some restrictions of the normal distribution. 

Nonparametric techniques are available when nothing is known about the 
underlying distributions. Space does not permit inclusion of these topics as 
well as q^her considerations of data analysis, such as treatment of outliers 
and transformations of variables to approximate normality and homoscedas- 
ticity. 

The availability of modem computer facilities makes possible the analysis 
of large data sets and that ability permits the application of multivariate 
methods to new areas, such as image analysis, and more effective analysis of 
data, such as meteorological Moreover, new problems of statistical analysis 
arise, such as sparseness of parameter or data matrices. Because hardware 
and software development is so explosive and programs require specialized 
knowledge, we are content to make a few remarks here and there about 
computation. Packages of statistical programs are available for most of the 
methods. 



CHAPTER 2 


The Multivariate 
Normal Distribution 


2.1. INTRODUCTION 


In this chapter we discuss the multivariate normal distribution and some of 
its properties. In Section 2.2 are considered the fundamental notions of 
multivariate distributions: the definition by means of multivariate density 
functions，marginal distributions, conditional distributions, expected values, 
and moments. In Section 2.3 the multivariate normal distribution is defined; 
the parameters are shown to be the means, variances, and covariances or the 
means, variances, and correlations of the components of the random vector. 
In Section 2.4 it is shown that linear combinations of normal variables are 
normally distributed and hence that marginal distributions are normal. In 
Section 2,5 we see that conditional distributions are also normal with means 
that are linear functions of the conditioning variables; the coefficients are 
regression coefficients. The variances, covariances，and correlations 一 called 
partial correlations — are constants. The multiple correlation coefficient is 
the maximtim correlation between a scalar random variable and linear 
combination of other random variables; it is a measure of association be¬ 
tween one variable and a set of others，The fact that marginal and condi¬ 
tional distributions of normal distributions are normal makes the treatment 
of this family of distributions coherent. In Section 2.6 the characteristic 
function, moments, and cumulants are discussed. In Section 27 elliptically 
contoured distributions are defined; the properties of the normal distribution 
are extended to this larger class of distributions. 


An Introduction to Multiixmalc Statistical AnalyMs, •Hurd Edition. By T, W. Anderson 
ISBN 0-47!-36091-0 Copyright ⑥ 2003 John Wiley & Sons ， Inc. 
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2.2. NOTIONS OF MULTIVARIATE DISTRIBUTIONS 


2.2.1. Joint Distributions 

In this section we shall consider the notions of joint distributions of several 
variables, derived marginal distributions of subsets of variables, and derived 
conditional distributions. First consider the case of two (real) random 
variables t X and Y. Probabilities of events defined in terms of these variables 
can be obtained by operations involving the cumulative distribution function 
(abbreviated as cdf )， 

(1) F(x,y) = Prf^<^, Y<y], 

defined for every pair of real numbers (x y y). We are interested in cases 
where F(x,y) is absolutely continuous;- this means that the following partial 
derivative exists almost everywhere: 


( 2 ) 

and 

(3) 


3 2 F(x,y) 

d 气 dy 


=/o ， >o. 


F(x,y) = f f f(u,u) dudu. 

J — 00 J -00 


The nonnegative function f(x 9 y) is called the density of X and Y. The pair 
of random variables {X,Y) defines a random point in a plane. The probabil¬ 
ity that (X,Y) falls in a rectangle is 

(4) Pr{x <X <x + ^x 7 y <Y <y + A_y} 

^F(x + + A_y) -F(x + ^x 9 y) -F(x y y + A_y) +F(x,y) 

=f y ^ Ay f X + dkX f(u > u) dudu 

J y J x 

>0, A_y > 0). The probability of the random point (X,Y) falling in any 
set E for which the following integral is defined (that is, any measurable set 
E) is 


(5) Pr{(^,r)e£}=/ fj(x,y)dxdy. 

+ In Cliapter 2 we shall distinguish between random variables and running variables by use of 
capital and lowercase letters, respectively. In later chapters we may be unable to hold to this 
convention because of other complications of notation. 
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This follows from the definition of the integral [rs the limit of sums of the 
sort (4)]，If f(x,y) is continuous in both variables, the probability element 
f(x,y) is approximately the probability that X falls between x and 

x + Ajc and Y falls between y and y + l^y since 


(6) ¥t{x <X<x + ^x 9 y<Y<y + Ay} = f y+Ay f X+Ax f(u 9 u) du do 

=f(x 0 ,y 0 ) 

for some x 0 ,y Q (x <x 0 <x + y <^ 0 <y 4 - Ay) by the mean value theo¬ 
rem of calculus. Since /(w, u) is continuous, (6) is approximately f(x, v) Lx Ay, 
In fact, 

(7) + + 

-f{x,y) AzAyl =0_ 

Now we consider the case of p random variables X h X 2y ^. y X p . The 
cdf is 


(8) F(x l ,...,x p ) = <x x ,...,X p <x p ) 

defined for every set of real numbers a ： j, ..., x p . The density function, if 


F(x i7 • • • ， x p ) is absolutely continuous, is 

d p F{x x ^..,x p ) 


( 9 ) 


dXy 


3x t ' 


f( ^1J ■ * ■ 5 ^n) 


(almost everywhere), and 


疒 c x \ 

(10) F{^ ^ j,..., x ) — I • • •//( ^ 11 • - • j ^ p) i ' *' ■ 

^ — 00 ^ —00 


The probability of falling in any (measurable) set R in the /7-diniensional 
Euclidean space is 


(11) Pr ； (Z l5 ...,^)ei?}= f ■- ff( Xl ， … ， xddx^.dxp. 

The probability element f(x if ^. > x p )^x l ••- ^x p is approximately the prob¬ 
ability Prl^：! <X x <x x + C^x p ) if /( 々 ，••”〜）is 
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continuous. The joint moments are defined as + 

(12) —XpP = f … f … 々 /Oi ， ... ， ;*:) 办 ! …办 

* J _oo J —oc 

2.2.2« Marginal Distributions 

Given the cdf of two random variables X, Y as being F(x, y), the marginal 
cdf of X is 

(13) =Pr{^<>:,y^oo} 

= F(x,<x>). 

Let this be F(x). Clearly 

(14) F(x) = f f f(u ， u) dudu. 

j — CO-^ — 00 

We call 

(15) / f(u,u)dv^f(u), 

J —00 

say, the marginal density of X. Then (14) is 

(16) F(x) = f f(u) du. 

J —00 

In a similar fashion we define G{y\ the marginal cdf of V， and g{y\ the 
marginal density of Y. 

Now we turn to the general case. Given as the cdf of 

Xu … 、 X p 、wc wish to find the marginal cdf of some of X v ..., X p7 say，of 
(r <p\ It is 

(17) ，…，尤 

=Pr{A <x 1 ,...,X r <,x r ,X r+l <<x>,.,, t X p <oo) 

= 尸( h ， ■ ，，，，00,，，，， 00), 

The marginal density of is 

CO CO 

(18) / / /X 叉 1， ’ ’ ’ ’ 无广 ’ “r+1 ’ ’ ’ • ’ “p) + 1 dli • 

^ S will be used to denote mathematical expectation. 
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The marginal distribution and density of any other subset of X\ '… 、 X p are 
obtained in the obviously similar fashion. 

The joint moments of a subset of variates can be computed from the 
marginal distribution; for example, 


(19) ".X 卜 = 找卜 

= f … / …， \) 办 i …办 p 

_ic J —cc 

/ oc -oc 

■■/ -.4 

- oc 』 _oc 


0C 3C 

J ’ . j" /(^i,..., Xp) dx r+ ^ 


^ U^i —dx r . 


2.2J. Statistical Independence 

Two random variables X, Y with cdf F(x,y) are said to be independent if 


(20) F(x } y)=F( K x)G(y) i 

where F(a ) is the marginal cdf of X and G(y) is the marginal cdf of F. This 
implies that the density of X 7 Y is 


( 21 ) 


f(x,y) 


d 2 F{x y y) 
dx dy 


d 2 F{x)G{y) 
dx dy 


dF(x) dG(y) 
dx dy ~ 


Conversely, if /U, y ); 


=f(x)g(y). 
■f(x)g(y\ then 


(22) F(x,y)=f y f f(u,u) dudu= [ y f f(u)g(u) dudu 

J —OC^ —CO J — 30^ —00 

= fj{ w) du 〔 j(u) du = F(x)G(y). 

Thus an equivalent definition of independence in the case of densities 
existing is that f(x,y) = : f(x)g(y). To see the implications of statistical 
independence, given any x x <x l7 y\ < 少 2 , we consider the probability 

(23) ?t{ Xi <X<x 2 , yi <Y<y 2 ) 

/(w 5 u) dudv = J f(u) duj g(v) du 

/. Ai J x y J y t 

= Fr{x i ^X<x 2 }?r{y l <Y<y 2 }. 
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The probability of X falling in a given interval and Y falling in a given 
interval is the product of the probability of X falling in the interval and the 
probability of Y falling in the other interval. 

If the ccf of X p is F(x { , • •. ， x p \ the set of random variables is said 

to be mutually independent if 

(24) F(x ly ...,x p ) = … F p { Xp ), 

where is the marginal cdf of X t> i = 1,The set X h ." ， A" r is said 
to be independent of the set X r+l ,X p if 

(25) > •. • ，文卩） = 文 | ” • • ，， oo,. _ ■ ， oo) • */^(00, . ■ • ， oo’ + 】 ” ）• 

One result of independence is that joint moments factor. For example, if 
X l7 ... y X p are mutually independent^ then 

(26) 叫 1 f ■■•xpfAxHiXpX dx p 

=n / dx, 

= ri(^}. 


2*2.4. Conditional Distributions 

If A and B are two events such that the probability of A and B occurring 
simultaneously is P(AB) and the probability of B occurring is P(B) > 0, 
then the conditional probability of A occurring given that B has occurred is 
P(AB)/P(B). Suppose the event A is X falling in the interval [x ly x 2 ] and 
the event 5 is y falling in [_yj,_y 2 ]. Then the conditional probability that X 
falls in [xiyX 2 \ given that Y falls in [y u y 2 \ is 


(27) 


Pr {Mh ^. 2 }= 


f f /(Uyu) dudu 

J yi _ 

\ 8{v)dv 
y\ 


Now let y { =y, y z =y + ^y. Then for a continuous density, 

(28) f y + ^ y g{u) dv = g{y*) ^y, 

y 
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where y <y* <y + A 少 . Also 

(29) f y + Ay f(u,u)du = f[u,y*(u)\ ^y, 
y 

where y ^y*(u) + /^y. Therefore ， 

(30) Pr{^i <X <x 2 \y <Y<y + ^y} = 广 ’ [:’( 〉 ()“)] 也 . 

It will be noticed that for fixed y mid A_y (> 0), the integrand of (30) behaves 
as a univariate density function. Now for y such that g(y) > 0, we define 
Pr{^j <X < \ i|K = >?}, the probability that X lies between x } and x 2y given 
that Y is as the limit of (30) as A_y — 0. Thus 

(31) Prfj：! <X <x 2 \Y = y) = f 2 f(uly) du, 


where f(u\y) =/(u, y)/giy\ For given y, f(u\y) is a density function and is 
called the conditional density of X given y. We note that if X and Y are 
independent, f(x\y) = fix). 

In the general case of ，…， with cdf F(x ] ,... 9 x p ) y the conditional 
density of X l ,... 9 X r , given X r+i =x r ^ ly ..., X p =x p , is 


/( x j,..*, x p ) 

f{u x ^.. y u r ,x r ^^...,x p ) “ du r 


For a more general discussion of conditional probabilities, the reader is 
referred to Chung (1974)，Kolmogorov (1950), Loeve (1977),(1978), and 
Neveu (1965). 

2.2.5. Transformation of Variables ； 

Let the density of X [y ...,X p be ,^x p ). Consider the p real-valued 

functions 

(33) 兄^兄 ㈠ ” ..•，&)， f = 1， •••，/?• 

We assume that the transformation from the x-space to the y-space is 
one-to-one; 1 the inverse transformation is 

(34) \ …， 八）， / = 1，•♦” p . 


’More precisely, we assume this is true for the part of the x-space for which f(x^..^x p ) is 
positive. 
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Let the random variables be defined by 

(35) Y, X p ), 

Then the density of Y lt ...,Y p is 

(36) 犮 （ _Vi ， ... ， _Vp) = f \ x \{,y \ i •' • y y p) t ^ x p{,y • i yp)\j ii •' - yp) •> 


where ) is the Jacobian 


(37) 


7 ( 71， … ， >V) = mod 


dX x 

dX^ 

^y\ 

办 2 

dX 2 

dX 2 


dyz 

dx p 

dx P 


d y\ d y% 



We assume the derivatives exist, and “mod” means modulus or absolute value 
of the expression following it. The probability that ( ： X\ ， … ， X p ) falls in a 
region R is given by (11); the probability that (V,,falls in a region S is 


(38) Pr{(y 1 ,...,y p )e5} = / ” >V) 屯 … 私 . 


If S is the transform of R, that is, if each point of R transforms by (33) into a 
point of S and if each point of S transforms into R by (34 )， then (11) is equal 
to (38) by the usual theory of transformation of multiple integrals. From this 
follows the assertion that (36) is the density of Y v ^..,Y p . 


23. THE MULTIVARIATE NORMAL DISTRIBUTION 
The univariate normal density function can be written 

A * 

(1) 、 夂一尽 ） 2 = 众 g —— 尽 ） Q((Z — 卢）， 

where a is positive and k is chosen so that the integral of (1) over the entire 
^-axis is unity. The density function of a multivariate normal distribution of 
… 、 X p has an analogous form. The scalar variable x is replaced by a 
vector 
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the scalar constant j3 is replaced by a vector 

(3) 卜：； 

b p 


and the positive constant a is replaced by a positive definite (symmetric) 
matrix 


(4) 


A = 




a n 

a u 

… a\ P 

°2\ 

a 22 

… a lp 


a p2 

.•. a 

冲 i 


The square a(x - /3) 2 = (jc - (3)a(x — /3) is replaced by the quadratic form 


(5) (x-b)'A(x-b) = D SKK'H 

1 


Thus the density function of a p‘variate normal distribution is 

(6) f(x l ,...,x p )=Ke-^~ byA(x - b \ 

where K (> 0) is chosen so that the integral over the entire p-dimensional 
Euclidean space of x 1? ...,x p is unity. 

Written in matrix notation, the similarity of the multivariate normal 
density (6) to the univariate density (1) is clear. Throughout this book we 
shall use matrix notation and operations. Th^ reader is referred to the 
Appendix for a review of matrix theory and for definitions of our notation for 
matrix operations. 

We observe that is nonnegative. Since A is positive definite, 

(7) (x-b) f A(x-b)>Q, 
and therefore the density is bounded; that is, 

⑻ <K. 

Now let us determine K so that the integral of (6) over the p-dimensional 
space is one. We shall evaluate 


(9) 






dx p dx y . 
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We use the fact (see Corollary A. 1.6 in the Appendix) that if A is positive 
definite, there exists a nonsingular matrix C such that 

(10) C'AC = I, 

where I denotes the identity and C' the transpose of C. Let 

(11) x-b^Cy, 
where 

( 12 ) 7 = • 

Then 

(13) (x-b)'A(x-b) =y'C'ACy=y'y. 

The Jacobian of the transformation is 

(14) 7= mod|C|, 

where mod|C| indicates the absolute value of the determinant of C. Thus (9: 
becomes 

(15) 

We have 

(16) 


K* = mod\C\ f 


00 -CO 


- b'y 


00 J —00 


dy p … dy v 


/ 1 P F 

e -h f y = eX p - 2 Hyf = 


where exp(z) = e z . We can write (15) as 


(17) = mod |C| J j e-H.. W 办 〆 ••办 l 

=mod |Cin|/ e^^dy}j 

p _ 

=mod \ C\ P~[ {^/27r} 

=mod \ C\ (2ttY p 
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by virtue of 

1 I 2 

(18) e~^dt=\. 

y ATT J 一 00 


Corresponding to (10) is the determinantal equation 

(19) icrMMci = |/|. 
Since 

(20) \C r \ HC1 ， 
and since |/| =1， we deduce from (19) that 


(21) modlCl = l/y[\A\ ■ 
Thus 

(22) K= 1/K* = vT4T(27r)~^ 
The normal density function is 


(23) 


r -Ux-b)'A(.X-b) 

(^r p 


We shall now show the significance of b and A by finding tne first and 
second moments of X v . • • ， X pt It will be convenient to consider these 
random variables as constituting a random vector 

(24) \ . 

W 

We shall define generally a random matrix and the expected value of a 
random matrix; a random vector is considered as a special case of a random 
matrix with one column. 


Definition 2.3.L A random matrix Z is a matrix 
(25) Z= (Z^), g= /z l,.,., n, 


of random variables Z u ， •.. ， Z mtr 
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If the random variables ,Z m/! can take on only a finite number of 

values, the random matrix Z can be one of a finite number of matrices, say 
Z(l),..., Z{q\ If the probability of Z = Z(i) is p Jf then we should like to 
define as Lf^ ； Z(i)7 r Then £Z = If the random variables 

Z mn have a joint density, then by operating with Riemann sums we 
can define £Z as the limit (if the limit exists) of approximating sums of the 
kind occurring in the discrete case; then again SZ =( 衣 Z gh ), Therefore, in 
general we shall use the following definition ： 

Definition 232. The expected value of a random matrix Z is 

(26) (<^Z gh ), g = 1,..., m, h = 


In particular if Z is ^ defined by (24), the expected value 

f 

(27) 打 = : 

1^1 

is the mean or mean vector of X. We shall usually denote this mean vector by 
jju If Z is (尤一 iiXX- the expected value is 

(28) €(X) ^ ^(X- v ， )(X- v，)' 


the covariance or covariance matrix of X. The fth diagonal element of this 
matrix, £{Xi — /i £ ) 2 , is the variance of X t> and the z, /th off-diagonal ele¬ 
ment, - aXA - 叫 )， is the covariance of X k and i ¥= /. We shall 
usually denote the covariance matrix by X. Note that 

(29) f(X) = tf(XX f - ilX 1 -X\L r + = <fXX l - 

The operation of taking the expected value of a random matrix (or vector) 
satisfies certain rules which we can summarize in the following lemma: 

Lemma 23J, If Z is an mXn random matrix, D is an Ixm real matrix, 
E is an nXq real matrix, and F is an IXq real matrix，then 


( 30 ) 


S{DZE+F) =D{SZ)E +F. 



18 


THE MULTIVARIATE NORMAL DISTRIBUTION 


Proof. The element in the ith row and /th column of S{DZE +F) is 

(31) A Ed,nZ hs e g) +/,) = E +/,;， 

、 fl.g } 

which is the element in the /th row and /th column of D(<^Z)E + F. 


Lemma 2.3.2. If Y= DX + /, where X is a random vector^ then 

(32) SY^DSX^rf, 

(33) €{Y)^D€{X)D\ 


Proof. The first assertion follows directly from Lemma 2:3.1, and the 
second from 


(34) ^(Y) = S{Y- SY){Y- <^K) / 

= <^[DX+f- (D(^X+f)][DX+f- (D<^X+f)y 
= S[D{X-<^X)][D(X- SX)} 1 
= <^[D(X-gx)(x - sxyD r ] y 

which yields the right-hand side of (33) by Lemma 2.3.1. ■ 

When the transformation corresponds to (11)，that is, X= CY + b y then 
SX = C<^Y + b. By the transformation theory given in Section 2.2, the density 
of Y is proportional to (16 )； that is, it is 


(35) 


- e -ly'y = 

(2# 



The expected value of tlje /th component of Y is 

(36) ^ = … 办 1 dy p 
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The East equality follows because 1 y x e~~ ^ is an odd function of y r Thus 
SY= 0. Therefore, the mean of X y denoted by ft , is 

(37) \L = SX=b. 

From (33) we see that €{X) = CiSYY^C 1 . The /，/th element of SYY f is 


( 38 ) /_>•/_>). n {ylr—i 办广為 


because the density of Y is (35). If i = /, we have 



The last equality follows because the next to last expression is the expected 
value of the square of a variable normally distributed with mean 0 and 
variance L If i ^ /, (38) becomes 


(40) 叫 = 

•n { 

= 0 ， i 丰 j, 

since the first integration gives 0. We can summarize (39) and (40) as 



(41) SYY* 

Thus 


(42) S{X-il)(X-[lY = CIC = CC\ 

From (10) we obtain A = (C’ 广 1 C 一 1 by multiplication by (C’)— 1 on the left 
and by C l on the right. Taking Inverses on both sides of the equality 


^tentatively, the last equality follows because the next to last expression is the expected value 
of a normally distributed variable with mean 0. 
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gives us 

(43) CC =A~ l . 

Thus, the covariance matrix of X is 

(44) 2= 

From (43) we see that X is positive definite. Let us summarize these results. 

Theorem 2.3.1, If the density of a p-dimensional random vector X is (23 )， 
then the expected value of X is b und the covariance matrix is A_ [ . Conversely, 
given a vector jx and a positive definite matrix 2, there is a multivariate normal 
density 

(45) (2冗）一 h ’“1) 

such that the expected value of the vector with this density is |x and the covariance 
matrix is X. 


We shall denote the density (45) as n(x\ jjl, and the distribution law as 

The ith diagonal dement of the covariance matrix, cr ny is the variance of 
the ith component of X; we may sometimes denote this by of. The 
correlation coefficient between X x and X } is defined as 


(46) 




This measure of association is symmetric in X i and X } \ p i} = p jr Since 


(47) 


/X 



a ll 

a u 


A 

a jjj 





o ) 2 


is positive definite (Corollary A 丄 3 of the Appendix), the determinant 


(48) 


a i a jPu 




= - Pfi) 


is positive. Therefore, -1 < p l; < L (For singular distributions, see Section 
2.4.) The multivariate normal density can be parametrized by the means fi it 
i=l”.. ， p ， the variances a t 2 , i=l ， •••，/?， and the correlations p" 9 i <j[j 
/，一 1) • • « ) P ~ 
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As a special case of the preceding theory，we consider the bivariate normal 
distribution. The mean vector is 


(49) 



-hi 




the covariance matrix may be written 


(50) 


一 |(^ 2 -^ 2 )(^-^) {X 2 -^ 2 f ) 


f \ 


f 2 \ 

°"ll °"l2 



{ a 2l °22 j 


^)^2 P ^7 ! 


where is the variance of A 、， cr 2 2 the variance of X 2 , and p the 
correlation between X { and X 2 . The inverse of (50) is 


(51) 



P 



The density function of and X 2 is 

1 I 1 


(52) 


27rcr ]l cr 2 yl— 


exp< 


( 文 l 一 Mi) 


2(1-〆 ） 

(^I - Ml) ( 文 2 — M 2 ) 


— 2p- 


x 2 — fl 2 ) 


a \ a 2 


Theorem 23.2. The correlation coefficient p of any bivariate distribution is 
invariant mth respect to transformations X* = 6^, +c^ b { >0 ， i = 1,2. Every 
function of the parameters of a bivariate normal distribution that is invariant with 
respect to such transformations is a function of p. 


Proof The variance of X* is bW i = 1,2, and the covariance of Xf and 
X* is b x b 2 (T\(T 2 P by Lemma 2.3.2. Insertion of these values into the 
definition of the correlation between A"* and X* shows that it is p. If 
f(fjL: ， fju 2 , a l9 cr 2 , p) is invai iant with respect to such transformations, it must 
be /XO, 0,1,1, p) by choice of b i = 1/or. and c,. = - i = 1,2. ■ 
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The correlation coefficient p is the natural measure of association between 
.Y, and X 2 . Any function of the parameters of the bivariate normal distribu¬ 
tion that is independent of the scale and location parameters is a function of 
p. The standardized variable (or standard score) is Y t = (X i - ^ x )/(r r The 
mean squared difference between the two standardized variables is 

(53) €(K) 2 = 2(1 - P ). 

The smaller (53) is (that is, the larger p is), the more similar Y x and Y 2 are. If 
p > 0, and X 2 tend to be positively related, and if p< 0, they tend to be 
negatively related. If p = 0, the density (52) is the product o ’ the marginal 
densities of X { and X 2 \ hence X x and X 2 are independent* 

It will be noticed that the density function (45) is constant on ellipsoids 

(54) (x- (x) =c 

for ever> positive value of c in a p-dimensional Euclidean space. The center 
of each ellipsoid is at the point 卜 The shape and orientation of the ellipsoid 
are determined by X, and the size (given 2) is determined by c. Because (54) 
is a sphere if ^ = (r 2 / ， /i{x\ |x, a 2 1) is known as a spherical normal density. 

Let uh consider in detail the bivariate case of the density (52). We 
transform coordinates by ( x t - fx )/ o \ = 兄 ， i = 1,2, so that the centers of the 
loci of constant density are at the origin. These loci are defined by 

( 55 ) ~ 2p y^ 2+ y^) = c - 

The intercepts on the _y「axis and 少 2 ‘axis are equal If p > 0, the major axis of 
the ellipse is along the 45° line with a length of 20:(1 + p) ， and the minor 
axis has a length of 2^/c(l — p), If p < 0, the major axis is along the 135° line 
with a length of 2^/c(l - p) ， and the minor axis has a length of 2^/c(l ^Tp)", 
The value of p determines the ratio of these lengths. In this bivariate case we 
can think of the density function as a surface above the plane. The contours 
of equal density arc contours of equal altitude on a topographical map; they 
indicate the shape of the hill (or probability surface). If p> 0, the hill will 
tend to run along a line with a positive slope; most of the hill will be in the 
first and third quadrants, When we transform back to • 乂 . + we 
expand each contour by a factor of a t in the direction of the ith axis and 
shift the center to ( ix 2 X 
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The numerical values of the cdf of the univariate normal variable are 
obtained from tables found in most statistical texts. The numerical values of 


(56) 


F{x^x 2 ) = Prf^ <x : ,X 2 <x 2 } 






a 2 


where y Y — (x l - and y 2 = (x 2 ― /x 2 )/a- 2 , can be found in Pearson 

(1931). An extensive table has been given by the National Bureau of Stan¬ 
dards (1959). A bibliography of such tables has been given by Gupta (1963). 
Pearson has also shown that 


(57) F(x 15 x 2 ) = E A(h) T /(h )， 

；-o 

where the so-called tetrachoric functions Tj(y) are tabulated in Pearson (1930) 
up to r l9 (y). Harris and Soms (1980) have studied generalizations of (57). 


2.4. THE DISTRIBUTION OF LINEAR COMBINATIONS OF 
NORMALLY DISTRIBUTED VARIATES ； INDEPENDENCE 
OF VARIATES; MARGINAL DISTRIBUTIONS 

One of the reasons that the study of normal multivariate distributions is so 
useful is that marginal distributions and conditional distributions derived 
from multivariate normal distributions are also normal distributions. More¬ 
over, linear combinations of multivariate normal variates are again normally 
distributed. First we shall show that if we make a nonsingular linear transfor¬ 
mation of a vector whose components have a joint distribution with a normal 
density, we obtain a vector whose components are jointly distributed with a 
normal density. 

Theorem 2.4.1, Let X (with p components) be distributed according to 
Then 

( 1 ) Y^CX 

is distributed according to MCjjl ， C2C0 for C nonsingular. 

Proof. The density of Y is obtained from the density of X y az(jc|jjl, 2), by 
replacing x by 

(2) x = C~ l y, 
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and multiplying by the Jacobian of the transformation (2 )， 

(3) 

The quadratic form in the exponent of n(x\ |t, S) is 

(4) Q = (x- |ji). 

The transformation (2) carries Q into 

(5) 

=(C~ ] y -C-'Cinyi^ (C-^y ~ C~ 1 Cil) 

♦ 1 [(: '(y-Ci^)} 

^(y-C^yiC^y^C-^y-Cyi) 

=( 广 C(OWTi ( 广 4) 

since (C -1 )'= (C) 一 1 by virtue of transposition of CC ~ 1 =/. Thus the 
density of Y is 

( 6 ) n(C^y\iL^)mod\C\^ 1 

- ( 2 tt ) lp \C^C'\ ^ ' exp 一 4 ( 少一 Cji^YCSC ’） 一 1 (少 ~ 

- n(y\C\L,C^C , ). ■ 

Now let us consider two sets of random variables X ]y . X q and 
X q+ly ..., X p forming the vectors 


⑺ 

尤 (1) = 

尤) 

• 

« 

a 

, = 








These variables form the random vector 




Now let us assume that the p variates have a joint normal distribution with 
mean vector 


( 9 ) 


SX W = SX m - il 12 \ 
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and covariance matrices 


(10) - jt ⑴ ) ， -2 n , 

( 11 ) i{X^ - ^){X^ - = 

(12) i ( X ^~ m ■⑴ )( 尤⑵一 =2 12 . 


We say that the random vector X has been partitioned in (8) into subvectors, 
that 


(13) 




l (1) 

_( 2 ) 


has been partitioned similarly into subvectors, and that 


(14) 


2 



has been partitioned similarly into submatrices. Here 2 21 = S r 12 . (See Ap¬ 
pendix, Section A.3.) 

We shall show that X {1) and X (2) are independently normally distributed 
if 2 12 = = 0. Then 


(15) 

Its inverse is 

(16) 



0 x 



0 


Thus the quadratic form in the exponent of n(x\ jt, 2) is 
(17) G 幸 (X 一 一 ft) 


[(/> - 〆 w) - w 


乂 1 0 1 

x ⑴ 

( 0 % 



x 


n>- „o) 


x (2 > - Jt (2 > 


=Gi + Gz ， 
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say, where 



(18) 

仏 = ( 文⑴ - 

PWlV 1 )-〆 1 ))， 

q 2 =^ (2) - 

pHW 2 )-〆 2 )). 


Also we note that |X| = |2 U | • |S 22 |. The density of X can be written 

(19) = —— t —— 

、 ； (277)^|2M 

=_ I _g-iCi_ I _ e ~iQi 

(277) 叫 2 U | 士 (277 .” ( "U 

= n(^ (1) l jt n) ,2 n )n(jc (2) | 2s). 

The marginal density of X (l) is given by the integral 

(20) [ ••• f n (啦， 2) 办 + 丨…办 

J ^00 J ^oc 

= n(jc (1 V (1) >^n)/ •■- f 中 (2) 1〆 2 ) ， S 22) 办 <?+ 〖 … 办 P 

=n (义 ll V ⑴， 1 "). 

Thus the marginal distribution of X {1) is 2 n )； similarly the marginal 

distribution of X (2) is M〆 2 ) ， D 22 )_ Thus the joint density of X u …， ' is the 
product of the marginal density of X { ,.. m ,X q and the marginal density of 
X q +\ ， … 、 X p 、and therefore the two sets of variates are independent. Since 
the numbering of variates can always be done so that X il) consists of any 
subset of the variates, we have proved the sufficiency in the following 
theorem: 

Theorem 2A2. If X” … 、 X p have a joint normal distribution, a necessary 
and sufficient condition for one subset of the random variables and the subset 
consisting of the remaining variables to be independent is that each covariance of 
a variable from one set and a variable from the other set is 0, 
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The necessity follows from the fact that if X t is from one set and X j from 
the other, then for any density (see Section 2.2.3) 

(21) a i} = S{X t - 

00 00 

=[ … f ( x ,- - 
J —00 J 一 00 

./Og + 1. dx\ ■■- ^ 

/ ( x ,~ -dx 

^ — 00 

jt'M >00 

• f :• / 00 (' 一 1 > * •' > X p) ^*^+ 1 ，，# 

= 0 . 

Since a tj = p ip and a n cr ; ^ 0 (we tacitly assume that 2 is nonsingular), 
the condition cr" = 0 is equivalent to p l} = 0. Thus if one set of variates is 
uncorrelated with the remaining variates, the two sets are independent. It 
should be emphasized that the implication of independence by lack of 
correlation depends on the assumption of normality, but the converse is 
always true. 

.Let ns consider the special case of the bivariate normal distribution. Then 
^ (1) =x x , X (1) =-X 2 , M- (n = |J. (2) = J 0 l 2 . = a n = O-, 2 , 2 ； 22 = 0-22 = ff 2 2 . 

and 2 12 = 2 2 i = o- l2 = 0 ^ 0*2 P 12 * Thus if X { and X 2 have a bivariate normal 
distribution, they are independent if and only if they are uncorrelated. If they 
are uncorrelated, the marginal distribution of X t Is normal with mean and 
variance The above discussion also proves the following corollary: 

Corollary 2A1. If X is distributed according to N([k, 2) and if a set of 
components of X is uncorrelated with the other components, the marginal 
distribution of the set is multivariate normal with means, uariances, and covari¬ 
ances obtained by taking the corresponding components of and 2, respectively. 

Now let us show that the corollary holds even if the two sets are not 
independent. We partition X, |x, and 2 as before. We shall make a 
nonsingular linear transformation to subvectors 

(22) :^⑴：尤⑴+狀⑺， 

(23) K c2) =Jt c2) , 

choosing B so that the components of K (1) are uncorrelated with the 




28 


THE MULTI VAR I ATE NORMAL DISTRIBUTION 


components of Y (1) = X {2) . The matrix B must satisfy the equation 

(24) 0- <^(K (1) - 打 ⑴ )(y ⑺ - SY {2) )' 

=+ ex (2) - -b sx^){x {2) - ^x i2) y 

=<^[(x (1) - 灯 (1) ) + b ( x ^ 2) - <^X (2) )](X (2) - 灯 ⑵) , 
= 2 12 + 5^22 ■ 

Thus 8 = -U;；； 1 and 

(25) 

The vector 


(26) 






say, and 

(28) €(Y) = <^(r-v)(K-v) ( 

' ⑴一 v ⑴ ）（ y ⑴一 v (I) ), #( F ⑴一 V ( I ) )( F (2) - v (2) )' 

" 、 €(y ( 2 匕 v ( 2 ))(F (1 )-v ⑴ ) ， Z(F( 2 )-v( 2 ))(r( 2 )_v ( 2) )， 

芝 11 — ^12^22^21 

0 
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since 

(29) 忒 (F ⑴ _v (l) )(r ⑴ _ v (l) ), 

似( 1 )-，))-2 12 2抑( 2 )-#)]， 

= 211 — 2 |2 2 22 1 221 —芝 12 芝 22 1 2 21 + S 12 S 22* ^ 22 ^ 22 1 ^ 21 

= 2 || — ^ 12 ^ 22 * ^21 • 

Thus K (,) and K (2) are independent, and by Corollary 2.4.1 X i2) = K c2) has 
the marginal distribution Mjh c2) , S 2 2 ), Because the numbering of the compo 
nents of X is arbitrary, we can state the following theorem: 

Theorem 2.4.3. If X is distributed according to 2), the marginal 
distribution of any set of components of X is multivariate normal mth means, 
variances，and covariances obtained by taking the corresponding components of 
and 2, respectively. 

Now consider any transformation 

(30) Z = DX, 

where Z has q components and D is a qXp real matrix. The expected value 
of Z is 

(31) = 

and the covariance matrix is r 


(32) S{Z-D^){Z-D^y =D^D\ 

The case q~p and D nonsingular has been treated above. If q and D is 
of rank q, we can find a(p — q)Xp matrix E such that 


(33) 



is a nonsingular transformation. (See Appendix, Section A.3.) Then Z and W 
have a joint normal distribution, and Z has a marginal normal distribution by 
Theorem 2.4.3. Thus foe D of rank q (and X having a nonsingular distribu¬ 
tion, that is, a density) we have proved the following theorem: 



30 


THE MULTI VAR I ATE NORMAL DISTRIBUTION 


Theorem 2.4.4. If X is distributed according to Mjji ， 2 )，then Z — DX is 
distributed according to D2D0, where D is a qXp matrix of rank q < ， P- 


The remainder of this section is devoted to the singular or degenerate 
normal distribution and the extension of Theorem 2.4.4 to the case of any 
matrix D. A singular distribution is a distribution in /7-space that is concen¬ 
trated on a lower dimensional set; that is, the probability associated with any 
set not intersecting the given set is 0. In the case of the singular normal 
distribution the mass is concentrated on a given linear set [that is, the 
intersection of a number of (/? — l)-dimensional hyperplanes]. Let 少 be a set 
of coordinates in the linear set (the number of coordinates equaling the 
dimensionality of the linear set); then the parametric definition of the linear 
set can be given as x— Ay where A is a pXq matrix and X is a 
p-vector. Suppose that Y is normally distribulcd in the ^-dimensional linear 
set; then we say that 

(34) X = AY^\ 


has a singular or degenerate normal distribution in /?-space. If SY = v, then 
^X-Av 十 X = |x, say. If S{Y- v) r = T y then 

(35) iiy - ^A{Y-v)(Y-v) , A , ^ATA , - 2 , 

say. It should be noticed that if p> q, then 2 is singular and therefore has 
no inverse, and thus we cannot write the normal density for X. In fact, X 
cannot have a density at all，because the fact that the probability of any set 
not intersecting the ^-set is 0 would imply that the density is 0 almost 
everywhere. 

Now. conversely, let us see that if X has mean and covariance matrix 2 
of rank r, it can be written as (34) (except for 0 probabilities), where X has 
an arbitrary distribution, and F of r (^ p) components has a suitable 
distribution. If X is of rank r，there k a p Xp nonsingular matrix B such 
thal 

(36) 

where the identity is of order r. (See Theorem A.4.1 of the Appendix.) The 
transformation 




BXB 1 


K O' 
0 0 


BX = V = 


7(0、 


(37) 
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defines a random vector V with covariance matrix (36) and a mean vector 

(38) /F=Bjji = v = 

say. Since the variances of the elements of 厂 (2) are zero, — v (2) with 
probability 1, Now partition 

(39) =(C D), 

where C consists of r columns. Then (37) is equivalent to 

(40) X^B l V^(C =CV^ + DV^ 2) . 

Thus with probability 1 

(41) x=cK (l) + Dv (2 >, 

which is of the form of (34) with C as /I ， F (J) as F, and Dv (2) as X. 

Now we give a formal definition of a normal distribution that includes the 
singular distribution. 

Definition 2,4.1. A random vector X of p components with SX— jx and 
<^(X — (xXA'-* ft)' = 2 is said to be normally distributed [or is said to be 
distributed according to iV(|x ， S)] i/ there is a transformation (34 )， where the 
number of rows of A is p and the number of columns is the rank of 2, say r, and 
Y (of r components) has a nonsingular normal distribution, that is’ has a density 

(42) ke -iiy--yT-\y-^)^ 

It is clear that if 2 has rank p, then A can be taken to be I and X to be 
0; then X-Y and Definition 2.4,1 agrees with Section 2.3. To avoid redun¬ 
dancy in Definition 2.4.1 we could take T = I and v = 0. 

Theorem 2A5. If X is distributed according to Mjx ， 2 )，then Z = DX is 
distributed according to NiDyi, D2D'). 

This theorem includes the cases where X may have a nonsingular or a 
singular distribution and D may be nonsingular or of rank less than Since 
X can be represented by (34), where Y has a nonsingular distribution 
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N(v,T), we can write 

(43) 2 ： = A4y+DX ， 

where DA is 7 Xr. If the rank of DA is r, the theorem is proved. If the rank 
is less than r, say s y then the covariance matrix of Z, 


(44) 


DATA'D' =£， 


say, is of rank $. By Theorem A.4.1 of the Appendix, there is a nonsingular 
matrix 


(45) 

such that 


F 


Fi 


(46) FEF 1 


F^F\ 

FiEF\ 


F 2 EF' 2 


f {FyDA^iF^DA)' (F l DA)T(F 2 DA)' ' 


乂 0 、 

^DA^iF.DA)' (F 2 DA)T(F 2 DA)' } 


1° °J 


Thus F, DA is of rank 5 (by the converse of Theorem A.1.1 of the Appendix), 
and F 2 DA — 0 because each diagonal element of {F 2 DA)T(F 2 DA)' is a 
quadratic form in a row of F 2 DA with positive definite matrix T. Thus the 
covariance matrix of FZ is (46), and 


(47) FZ 




DAY^FDX 


[ F ' D n AY ) 

+ fD \ — 


\ 0 


1 °) 




say. Clearly U x has a nonsingular normal distribution. Let F^ 1 = (G x G 2 )_ 
Then 


(48) 

which is of the form (34). 


_ 


The developments in this section can be illuminated by considering the 
geometric interpretation put forward in the previous section. The density of 
X is constant on the ellipsoids (54) of Section 23. Since the transformation 
(2) is a linear transformation (i.e., a change of coordinate axes), the density of 
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Y is constuni on ellipsoids 

(49) (y-CiiYiClCy^y-C^^k. 

The marginal distribution of 义⑴ is the projection of the mass of the 
distribution of X onto the 9 -dimensional space of the first q coordinate axes. 
The surfaces of constant density are again ellipsoids. The projection of mass 
on any line is normal. 


2,5. CONDITIONAL DISTRIBUTIONS AND MULTIPLE 
CORRELATION COEFFICIENT 


2.5.1. Conditional Distributions 

In this section we find that conditional distributions derived from joint 
normal distribution are normal. The conditional distributions are of a partic¬ 
ularly simple nature because the means depend only linearly on the variates 
held fixed, and the variances and covariances do not depend at all on the 
values of the fixed variates. The theory of partial and multiple correlation 
discursed in this section was originally developed by Karl Pearson (1896) for 
three variables and excended by Yule (1897r ， 1897b)_ 

Let X be distributed according to Mm 2) (with 2 nonsingular). Let us 
partition 


(i) 


= ' A ： ⑺、 
一、 I (2) j 


as before into q- and (p - g)-component subvectors ， respectively. We shall 
use the algebra developed in Section 2.4 here. The joint density of F (1) =X (l) 
-^ 12^22 X i2) and is 

„ W(|) — 2 l2 5;iV 2) , 2 n - ^ 12^22 2 2 i)n(> (2l l (^ 2 ) ，222 ). 

The density of X il) and X i2) then can be obtained from this expression by 
substituting jc (I) —SuSh 1 / 2 ) for ) ⑴ and x( 2) for / 2) (the Jacobian of this 
transformation being 1 ); the resulting density of X (1) and is 

⑵ 

/W 2 ) 卜 --- ^^ ex P H[(x ⑴ - - S| 2 W(x ⑵ -〆))]' 
(2ir) 〜 |2". 2 | 

— 2) ) 却” 1 ， 
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where 

⑶ 艺 11.2 =— UG 艺 21 . 

This density must be n(jcl jjl, 2). The conditional density of 尤⑴ given that 
X {1) ==jc (21 is the quotient of (2) and the marginal density of A" (2) at the point 
x {1 \ which is n(x {2) \ |x( 2 )，2 22 )，the second factor of (2). The quotient is 

(4) 

f{x^\x^) = 1 —— ^exp[-i[(^ l) - IL^) - 2 12 5 ： 2 2 l (x (2) - lL i2) )]' 

( 2 叶 

⑴ -#))-ua ；*： ⑵- ^))n. 

It is understood that x {2) consists of p - q numbers. The density /(x( l) |x( 2 〉） 
is a y-variate normal density with mean 

(5) ^(A ： (I) U (2) ) = jt (1) + S I2 X J 2 1 (jc ( 2) - jt (2) ) - v(jc (2) ), 
say. and covariance matrix 

(6) 中 ⑴- v(/))] [X ⑴ - v(/>)] V 2) ) =2 ]l , 2 = 2 11 - 2 12 拉 2 21 . 

It should be noted that the mean of X (l) given x (2) is simply a linear function 
of x {2 \ and the covariance matrix of 义⑴ given x( 2) does not depend on x (2> 
at all. 

Definition 2.5.1. The matrix P = 2 i2 2 2 2 1 is the matrix of regression coef¬ 
ficients of X、'、on x (2) . • 

The element in the /th row and (k - q)th column of p = 2 12 2 22】 is often 
denoted by 

(7) A A q+\.. .,k-\.k + \. . i — 1, - . ., ^5 k = g + 1， • • • ， p. 

The vector ⑴ + p(/ 2> — fji c2) ) is called the regression function. 

Lcl a" " + i f) be the /,/th element of 2 n . 2 - We call these partial 
covammccs'y tr M , / + , J} is a partial variance. 

Definition 2.5.2 

. p . ; i " 

( 厂 二… ~ j — — —， ,，《/ — 1，•■.,《， 

q+l. ..，p 9 + 】.…. P 


is the partial correlation between X, and X! holdingX q + l ， … ， X p fixed 
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The numbering of the components of X is arbitrary and q is arbitrary. 
Hence, the above serves to define the conditional distribution of any q 
components of X given any other p — q components. In the case of partial 
covariances and correlations the conditioning variables are indicated by the. 
subscripts after the dot, and in the case of regression coefficients the 
dependent variable is indicated by the first subscript, the relevant condition¬ 
ing variable by the second subscript, and the other conditioning variables by 
the subscripts after the dot. Further, the notation accommodates the condi¬ 
tional distribution of any q variables conditional on any other r — q variables 
{q <r <, p\ 

Theorem 2.5.1. Let the components of X be divided into two groups com¬ 
posing the sub vectors and X^ 2 \ Suppose the mean jjl is similarly divided into 

and yS 2 \ and suppose the covariance matrix X of X is divided into 
2 n ,2 l2> 2)22 ， the covariance matrices of X°\ of X {l) and AT (2) , and of X {2 \ 
respectively. Then If the distribution of X is normal, the conditional distribution of 
X ⑴ given AT (2) = is normal with mean jjl ⑴ + and 

covariance matrix — ^ 12 ^ 22 ^ 21 - 

As an example of the above considerations let us consider the bivariate 
normal distribution and find the conditional distribution of X x given 尤 2 = 义 2 . 
In this case = 从 ” |x( 2 ) = fx 2 > 2n = S l2 — P> an d S 22 = cr 2 2 . Thus 
the lxl matrix of regression coefficients is ^ 12 X 22 = p/a 2 、and the 

lXl matrix of partial covariances is 

(9) Su-2 = 2 11 - ^12^22^21 = ^\ - O'iWpV^ = ^(i-p 2 )- 

The density of given x 2 is + (cr 】 p/cr 2 )(jc 2 — 弘 2 ) ， cr】 2 (l - p 2 )]. 

The mean of this conditional distribution increases with x 2 when p is 
positive and decreases with increasing x 2 when p is negative. It may be 
noted that when crj = cr 2 , for example, the mean of the conditional distribu¬ 
tion of x x does not increase relative to /x l as much as x 2 increases relative to 
- [Galton (1889) observed that the average heights of sons whose fathers’ 
heights were above average tended to be less than the fathers’ heights; he 
called this effect ‘‘regression towards mediocrity.’’] The larger |p| is，the 
smaller the variance of the conditional distribution, that is，the more infor¬ 
mation x 2 gives about This is another reason for considering p a 
measure of association between and X 2 - 

A geometrical interpretation of the theory is enlightening. The density 
/(jc lf x 2 ) can be thought of as a surface z-/( jc p jc 2 ) over the jc 2 -plane. If 
we intersect this surface with the plane x 2 = c, we obtain a curve 2 =/(jcj,c) 
over the line x 2 = c in the ^ 1? ^ 2 -plane. The ordinate of this curve is 
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proportional to the conditional density of X x given x 2 - c ； that is，it is 
proportional to the ordinate of the curve of a univariate normal distribution. 
In the more general case it is convenient to consider the ellipsoids of 
constant density in the p-dimensional space. Then the surfaces of constant 
density of f(x lt ...,x Q lc q+l> ^. y c p ) are the intersections of the surfaces of 
constant density of f(x ly ..,,x p ) and the hyperplanes , = c q+l7l .., ^ = 
c ; these are again ellipsoids. 

Further clarification of these ideas may be had by consideration of an 
actual population which is idealized by a normal distribution. Consider, for 
example, a population of father-son pairs. If the population is reasonably 
homogeneous the heights of fathers and the heights of corresponding sons 
have approximately a normal distribution (over a certain range). \ condi¬ 
tional distribution may be obtained by considering sons of all fathers whose 
height is, say, 5 feet, 9 inches (to the accuracy of measurement); the heights 
of these sons will have an approximate univariate normal distribution. The 
mean of this normal distribution will differ from the mean of the heights of 
sons whose fathers’ heights are 5 feet, 4 inches, say, but the variances will be 
about the same. 

We could also consider triplets of observations, the height of a father, 
height of the oldest son, and height of the next oldest son. The collection of 
heights of two sons given that the fathers’ heights are 5 feet, 9 inches is a 
conditional distribution of two variables; the correlation between the heights 
of oldest and next oldest sons is a partial correlation coefficient. Holding the 
fathers 7 heights constant eliminates the effect of heredity from fathers; 
however, one would expect that the partial correlation coefficient would be 
positive, since the effect of mothers’ heredity and environmental factors 
would tend to cause brothers 5 heights to vary similarly. 

As we have remarked above, any conditional distribution obtained from a 
normal distribution is norma] with the mean a linear function of the variables 
held fixed and the covariance matrix constant. In the case of nonnormal 
distributions the conditional distribution of one set of variates oil another 
does not usually have these properties. However, one can construct nonnor¬ 
mal distributions such that some conditional distributions have these proper¬ 
ties. This can be done by taking as the density of X the product n[x (l) \ ⑴ + 
P ( 太⑺一 jt (2) ) ， S n . 2 ]/(x (2) )，where f(x^ 2) ) is an arbitrary density. 4 

2.5.7. The Multiple Correlation Coefficient 

We again consider X partitioned into X 0) and X (2) . We shall study some 
properties of PA" (2) . 
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Definition 2.53. The vector 尤 ( 1,2 〉 = A" (1) — jt (1) — p(A"( 2 ) — is the vec- 
toi of residuals of X 、 1 、 from its regression on X^ 2 \ 

Theorem 2.5.2. The components of are uncorrelated with the compo¬ 
nents of X、 1 、• 

Proof. The vector X^ hl) is K (1) — ^K (I) in (25) of Section 2A ■ 

Let or^ be the ith row of 2 12 , and p ; (i) the fth row of P (i. e ” Pio 53 
or^S^ 1 ). Let Y{Z) be the variance of Z. 

Theorem 2.53. For every uector a 

( 10 ) Y(xy- 2) ) ^ - a'X^ 2) ). 

Proof. By Theorem 2.5.2 

( 11 ) y{X i -a'X { - 2) ) 

= ^[X t - hi- a'(X {2) - M - (2) )] 2 

= 啦 / 12 > - 灯 ” + (p ( 0 - «m (2) - M < (2) )] 2 
= 中鬥 + (P (0 - a)' £{X^ - 〆>)(_- m«( 2 )) ， (P ⑺ -a) 

= r(^ 1 - 2 )) + (p 0 ) -a)^ 22 (P (i) -a). 

Since S 22 is positive definite, the quadratic form in p (i) — a is nonnegative 
and attains its minimum of 0 at a= p (l ). ■ 

Since ^X^ l ： 2) = 0, Y(X^ 2) )= ^(^/ 1 * 2) ) 2 . Thus ^ + p f ( 0 (X ⑵一 jjl ( 2) ) is the 
best linear predictor of X t in the sense that of all functions of of the form 
a r X^ 2) + c, the mean squared error of the above is a minimum. 

Theorem 2.5*4. For every vector a 

(12) Corrf^,., P’ ⑺ A ： (2) ) 之 Corr(A r / , a'X (2) ). 

Proof. Since the correlation between two variables is unchanged when 
either or both is multiplied by a positive constant, we can assume that 
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<^[ 0 £’( 尤 ( 2 ) — p/ 2 ))] 2 = 列 |5’ ⑴ ( 尤 (2) — |x (2) )] 2 . Then the expansion of (10) is 

(13) a “ ~ 2 S { X ~ ⑴(尤( 2 ) — #)) +■ 准， (2) ) 

^ oi f - 2 ^(JV, - ^) a'( X i2) - m- (2] ) + Tia'X^). 

This leads to 

.… S { X - WP ' ⑴(妒) -#)) 巧 d ) a '( A ：( 2 ) - /)) „ 

( 丄斗 , - - , = - 乏 - , " - . 圓 

A'i 准 〆 ( 2 )) ^nV'x^) 


Definition 2.5.4. The maximum correlation between X t and the linear com¬ 
bination a f X (2) is called the multiple correlation coefficient between X t and X^ 2 \ 

It follows that this is 


( 15 ) 


,+I ■…. p _ R -^ (2) )(^ (2) - 

= <r ( > )| 2 - 2 1 0 r (() ^ 


A useful formula is 


(16) 


1 - 祀 



a h ~ or ( i )-^22 °*(0 



12,1 


where Theorem A.3.2 of the Appendix has been applied to 


(17) 

\ = 1 


艺 22 

/ 


Since 





(18) 

a ^q+i, . ,p 


— O ， (i)^ 22 l ° ， ( 0 > 

it follows that 





(19) 

+ 1 •. .p ~ 

巾 - 

珩 9+ i 

,,,，/ j) 


This shows incidentally that any partial variance of a component of X cannot 
be greater than the variance. In fact, the larger R rq+i . p is, the greater the 
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reduction in variance on going to the conditional distribution. This fact is 
another reason for considering the multiple correlation coefficient a measure 
of association between X ( and X (2 \ 

That ^[ t) X (2) is the best linear predictor of X { and has the maximum 
correlation between and linear functions of X (2) depends only on the 
covariance structure, without regard to normality. Even if X does not have a 
normal distribution, the regression of X (l) on X (2) can be defined by 
|jl ⑴ 一 the residuals can be defined by Definition 2.5.3; 

and partial covariances and correlations can be defined as the covariances 
and correlations of residuals yielding (3) and (8). Then these quantities do 
not necessarily have interpretations in terms of conditional distributions. In 
the case of normality ^ + (S’ ⑺ (x (2) _ |x (2) ) is the conditional expectation of X x 
given X (2) °=x (2 \ Without regard to normality, X f — <^X^X {2) is uncorrelated 
with any function of X i2 \ minimizes — h(X (2) )] 2 with respect 

to functions h(X {2) ) of X i2 \ and maximizes the correlation between 

X ( and functions of X i2) . (See Problems 2.48 to 2.51.) 


2.5.3. Some Formulas for Partial Correlations 

We now consider relations between several conditional distributions obtained 
by holding several different sets of variates fixed. These relations are useful 
because they enable us to compute one set of conditional parameters from 
another set. A very special case is 


( 20 ) 


Pl2.3 = 


Pi2 一 Pl3 P23 


this follows from (8) when p = 3 and q = 2. We shall now find a generaliza¬ 
tion of this result. The derivation is tedious, but is given here for complete¬ 
ness. 

Let 

(21) 尤= A ：( 2 ) ， 

v(3) 

where X (l) is of p x components, X^ 2) of p 2 components, and A" (3) of p 2 
components. Suppose we have the conditional distribution of ⑴ and X (2) 
given x( 3) =jc ( 3) ; how do we find the conditional distribution of X (l) given 
X (2) =x {2) and X (2) =jc ( 3) ? We use the fact that the conditional density of X il) 
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given Jt ( 2 ) =x (2) and X 0) = x (3) is 
( 22 ) 


f(x (l) \x (2) = /( x( 1 ) ， x( 2 ) ’x (3) ) 

n 1 5 } /W” 

/(X (1 W))//( ： t (3) 、 

= /W 3 ) )//(X ⑶） 

f(x^,X & lx^) 

= f(x^lx (3) ) * 


In the case of normality the conditional covariance matrix of A" (l 〉and X (T> 
giver 尤 (3) =.x (3) is 


(23) 


if 


ix^' 

U (2) 


x( 3) 


say, where 
(24) 


2： 


II 

芝 12 ’ 


:s l3 

21 

^22 j 


^ 2-23 

113 

$12 3、 


213 

^22-3 j 

y 


2 

12 

^13 

艺 21 

2 

22 

^23 

(艺 31 

2 

32 

S 33 


^33(^31 ^32) 


The conditional covariance of X (l) given X (2) =x (2) and X {2i) — x (3) is calcu¬ 
lated from the conditional covariances of 尤⑴ and X (2) given =x( 3 〉as 


(25) 


X (3) ] = 2 n . 3 -艺12.3(艺22.3)烹 


21 3. 


This result permits the calculation of < 7 如 1 + 1 .. p ，/,7 ^ 1,froi^ 

a U'P\h •…， p ，U = 1， …， Pi +P 2 . 烽 

In particular，for p x = p 2 = 1, and p 3 = p ~ q — l y v/t obtain 


(26) 

Since 

(27) 


^i] q+ I,,. .,p ^i/ q+2,. 




+ \ t q + l'(} + 2 ,,,,, p 




a ii q+l y ,,^p ~ a u q + 2 y , “.p(l - Pf、，9 + I q + 2, … ， p ) ， 
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we obtain 


( 28 ) 


— … ,P + Pj,q + \'q + 2,,, ,，P 

Pfj、q + \ ■，… p I" _ 2 /l _ 2 ~ 

V 丄 Pi.q+\q^2.,,. e p y ^ Pj,q 4-\'q + 2, ,, y p 


This is a useful recursion formula to compute from {p iy } in succession 
( Pij ， p ) 八 ft/p-I, pi， …， …，〆 


2.6. THE CHARACTERISTIC FUNCTION; MOMENTS 
2.6.1. The Characteristic Function 

The characteristic function of a multivariate normal distribution has a form 
similar to the density function. From the characteristic function, moments 
and cumulants can be found easily. 

Definition 2>6.1. The characteristic function of a random vector X is 

(1) (^(/) = Se itX 
defined for every real vector t. 

' To make this definition meaningful we need to define the expected value 
of a complex-valued function of a random vector. 

Definition 2.6.2. Let the complex-valued function g(x) be written as g(x) 
= gi(x) + ig 2 (x) y where g { (x) and g 2 (Jt) are real-valued. Then the expected value 
ofg(X) is 

(2) Sg{X)^^g x {X)^i^g 2 {X). 

In particufar, since e ie = cos 6 + i sin 0 ， 

(3) Se' t x = € cos t r X-bi^ sin t f X. 

To evaluate the characteristic function of a vector A", it is often convenient 
to use the following lemma: 

Lemma 2.6.1. Let X f = {X Wf X {：2), ). Jf X 0) and X i2) are independent and 
g(x) = g 0) {x {l) )g {2) {x^ 2) X then 


( 4 ) 
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Proof. If g(x) is real-valued and X has a density, 


(5) <^g(X) = J j(x)f(x) dx, dx p 

=f ••• f g (1) ( Jt (1) )g (2) {^ (2) )/ (1) (Jt (1) )/ (2) (x (2) ) dxj ••• dx 
=f “• f g {l \x {l] )f {l) (x {l) ) dx x dx 

J 一 OC J - cc H 

•/ :_ / 工 g(W ⑶ (JC( 2 、) dx q + l •- dx p 


If g(x) is complex-valued, 

(6) g(jc) = UW ⑴） +Ig2 1) (jC (I1 )][gi 2) (^ (2) ) +^2 2) (^ (2) )1 

( 义⑴ ) w ( 今忒 V 1 )) 忒 v 2) ) 

+ i[gW”) 於 V 2 )) +g(/V ⑴ ) 逆 (X( 2 ))l. 

Then 

(7) Sg{X)- ⑴ ) gpu( 2 )) - 於 ) U(”) 於 U( 2) )l 

+W[gw ⑴ )#(#»)+g( 1 "(x( I ))g( 2 2 )u( 2 ))] 

= 制 ” ( 尤⑴ ) 祝 2 ) (X( 2 >) - ig^{X ({) )Sg^{X (2) ) 

+i[M i) (f i )Kg ( I 2 )(x (2 ))+ 

=[AV'U ⑴） + MgW ⑴ )1 [ 以 2 )U( 2 )) +wd 2 )U( 2 ))] 

= ^g (I) (^ (,) )^g ( 2 ) (x (2) ). _ 

By applying Lemma 2.6.1 successively to g(X) — e ft we derive 


Lemma 2.6.2. If the components ofX are mutually independent, 

p 

(8) se u，x ^ ny e,v; - 

/=】 

We now find the characteristic function of a random vector with a normal 
distribution. 
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Theorem 2.6.1. The characteristic function of X distributed according to 

(9) 0(/) = ^e irx ^e ,, ， tL -i rlt 
for every real vector t. 

Proof. From Corollary A. 1.6 of the Appendix we know there is a nonsingu¬ 
lar matrix C such that 

(10) C^ l C = L 

Thus 

(it) = c-'c-' ^(ccy 1 . 

Let 

(12) X-il = CY. 

Then K is distributed according to iV(0, /). 

Now the characteristic function of Y is 

p 

(13) = ^e lu，r = ri AW. 

卜 l 

Since Y } Is distributed according to jV(0, 1 )， 

p 

(14) = 

^ /-， 

Thus 

(15) 0(/) = Se iVX = Se lt，{CY+ ^ 

=〆“ Se H ' CY 

for VC = u f ; the third equality is verified by writing both sides of it as 
integrals- But this is 

(16) ^e u， ^^ cc，t 


by (11). This proves the theorem. 


■ 
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The characteristic function of the normal distribution is very useful. For 
example，we can use this method of proof to demonstrate the results of 
Section 2.4. If Z = DX y then the characteristic function of Z is 

(17) ^e irz = ^e ii，DX = ^e KDiyx 

which is the characteristic function of N(D^ DXD f ) (by Theorem 2.6.1). 

It is interesting to use the characteristic function to show that it is only the 
multivariate normal disiribuiion that has the property that eveiy linear 
combination of variates is normally distributed. Consider a vector Y of p 
components with density f(y) and characteristic function 

(18) iff(u) = <Se iU，Y = f f 办 , … 办， 

J -CO —00 

and suppose the mean of K is |i and the covariance matrix is X. Suppose u*Y 
is normally distributed for every u. Then the characteristic function of such 
linear combination is 

(19) Se itu^ 

Now set t = 1. Since the right-hand side is then the characteristic function of 
the result is proved (by Theorem 2.6.1 above and 2.6.3 below). 

Theorem 2.6.2. If every linear combination of the components of a vector Y 
is normally distributed, then Y is normally distributed. 

It might be pointed out in passing that it is essential that every linear 
combination be normally distributed for Theorem 2.6.2 to hold. For instance, 
if Y = (Y,,y 2 )' and Y { and Y 2 are not independent, then Y l and Y 2 can each 
have a marginal normal distribution. An example is most easily given geomet¬ 
rically. Let X l9 X 2 have a joint normal distribution with means 0. Move the 
same mass in Figure 2.1 from rectangle A to C and from B to D. It will be 
seen that the resulting distribution of Y is such that the marginal distribu¬ 
tions of y, and Y 2 are the same as X x and X 2y respectively, which are 
normal, and yet the joint distribution of Y } and Y 2 is not normal. 

This example can be used also to demonstrate that two variables, Y x and 
Y 2} can be uncorrelated and the marginal distribution of each may be normal t 
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B 


D 


C 


Figure 2.1 



but the pair need not have a joint normal distribution and need not be 
independent. This is done by choosing the rectangles so that for the resultant 
distribution the expected value of Y { Y 2 is zero. It is clear geometrically that 
this can be done. 

For future reference we state two useful theorems concerning characteris¬ 
tic functions. 

Theorem 2.6.3. If the random vector X has the density f(x) and the 
characteristic function then 

(20) f{x) = f :; 屯… 

This shows that the characteristic function determines the density function 
uniquely. If X does not have a density，the characteristic function uniquely 
defines the probability of any continuity interval. In the univariate case a 
continuity interval is an interval such that the cdf does not have a discontinu¬ 
ity at an endpoint of the interval. 

’ Theorem 2-6A Let {^(.r)} be a sequence of cdfs 9 and let {(f> } (t)} be the 
sequence of corresponding characteristic functions, A necessary and sufficient 
condition for Fjix) to converge to a cdf F(x) is that，for every t, converges 
10 a limit that is continuous a，/ = 0. When this condition is satisfied，the 
limit is identical with the characteristic function of the limiting distribution 
F(xl 

For the proofs of these two theorems, the reader is referred to Cramer 
*,(1946), Sections 10.6 and 10.7. 




46 


THE MULTIVARIATE NORMAL DISTRIBUTION 


2.6.2. The Moments and Cumulants 

The moments of X p with a joint normal distribution can be obtained 

from the characteristic function (9). The mean is 

( 21 ) = 

1 dl h Uo 

= 7 (- !C<v ; + 

1 / 1 (-0 

= 

The second moment is 

( 22 ) 叫我 

4 h ) /-o 

=^ 1 ( - + 〜 /,)( - r - % 卜⑺ 

、 k f k r t 

= (T h) + IX f , tlj. 

Thus 

(23) Variance( X t )-= S {X t - ^ u it , 

(24) Cov a ri a nce(^„ X } ) = £{X t - 〜)({-〜）= V 

Any third moment about the mean is 

(25) S{X~^){X r ^){X k ~tx k )^0. 

The fourth moment about the mean is 

(26) - 烊 ,)( A - ^j){X k - fx k )(X,~ fx,) = ct-,. ; o- u + a^aj, + a^a jk . 
Every moment of odd order is 0. 

Definition 2.63. If all the moments of a distribution exist, then the cumu¬ 
lants are the coefficients k in 


㈣ ⑴ -£ 

S I ，…， p 
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In the case of the multivariate normal distribution k ]0 ... 0 = , k 0 ... 0I 

= ^p, k 2Q ... 0 = o- n .k 0 - 02 = a pp> . The cumulants for 

which > 2 are 0 , 

2.7, ELUPTICALLY CONTOURED DISTRIBUTIONS 

2.7.1* Spherically and Elliptically Contoured Distributions 

It was noted at the end of Section 2.3 that the density of the multivariate 
normal distribution with mean il and covariance matrix 2 is constant on 
concentric ellipsoids 

(1) ( ； e- … ’2 -1 ( 太 - 蚪） 

A general class of distributions vvith this property is the class of elliptically 
contoured distributions with density 

(2) |A 「 ^ [(: e-v)’A - V )]， 

where A is a positive definite matrix, g (-)2 0, and 

/ CO 广 00 

… / 办 i … 办广 i- 

-CO J —00 

If C is a nonsingular matrix such that C f \^ l C^ 1^ the transformation 
x-v = Cy carries the density (2) to the density giy'y). The contours of 
constant density of giy^) are spheres centered at the origin. The class of 
such densities is known as the spherically contoured distributions. Elliptically 
contoured distributions do not necessarily have densities，but in this exposi¬ 
tion Only distributions with densities will be treated for statistical inference. 

A spherically contoured density can be expressed in polar coordinates by 
the transformation 

(4) y x -rsin 0” 

3^ 2 ~ r cos sin 0 2 ， 
y 3 = r cos 0 ] cos 0 2 sin 0 3 , 

y p ^ { = r cos d Y cos d 2 .“cos d p _ 2 sin0 p — 2 , 
y p ^r cos cos 0 2 ••• cos 0 p ^ 2 cos 
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where - {tt< 6 x < i = l”" ， p-2 ， -tt< 8 p ^ { ^ 7r, and 0 ^ r <oo. 
Note that y'y = r 2 . The Jacobian of the transformation (4) is 
r p — 1 cos p ~ 2 6 t cos p ^ 3 6 2 **■ cos 0 p ^ 2 . See Problem 7.1. If g(y r y) is the density 
of K, then the density of /?,0 h ...,0 ；7 _, is 

(5) r p ^ x cos^ -2 ^, cos p_3 0 2 … cos 6 p 〈 g(j ls ). 

Note that 0^! are independently distributed. Since 

(6) 厂八 cos" 

’一 *tt/2 


imnii 

r[l(A + i)] 


(Problem 7.2), ihc marginal density of R is 

(7) C(p)^(r 2 )r^\ 

where 


( 8 ) 

C(P ： 


2 ,少 

Hb) 


广 ctt /1 

- 贾 -tt /2 


… r /2 cos 1 ^ 7 ^ cos p ~ 3 P 2 … cos … dd p — 2 d6 p _i. 

* - it/2 


The marginal density of 0, is r[{(p - ;)]cos p - M/{r ( 吾 ) r[j(p — f - 1 )]}， 
/ - 2, and of 6 p ^ l is 1/(277-). 

In the normal case of MO, /) the density of Y is r 


8(y r y) = ( 2 i)—&exp(_4y ： v )， 

and the density of R - (Y f Y)^ is r p ~^ ] exp (- \r 2 )/[2^^ 1 T(^p)]. The density 
of r 2 = l? is + u /[2^T(^p)]. This is the ^ y2 -density with p degrees of 

freedom. 

The constant C(p) is the surface area of a sphere of unit radius in p 
dimensions. The random vector U with coordinates sin 0,,cos sin 0 2 ，."， 
cos 0! cos 0 2 cos 0 p -p where are independently distributed 

each with the uniform distribution over ( - tt/2, tt/2 ) except for having 
the uniform distribution over (_7r ，7 r)，is said to be uniformly distributed on 
the unit sphere. (This is the simplest example of a spherically contoured 
distribution not having a density.) A stochastic representation of Y with thd 
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density g(y'y) is 

(9) Y^RU, 

where R has the density (7). 

Since each of the densities of 0,,...,©^., are even, 

( 10 ) 

Because R and U are independent, 

(11) (^K-0 
if <^R <oo. Further, 

(12) SYY f = 这 R 2 SUIT 

if SR 1 < oo. By symmetry SV 卜 … = 忍 11 卜 \/p because Ef_jt / r 2 = 1. 
Again by symmetry = SUyU^ = ••• = <^U p ^iU p . In particular SU l U 2 

=S sin 0 t cos ©, sin © 2 , the integrand of which is an odd function of 8 } and 
of 0 2 _ Hence, 广 Q ， i^j. To summarize, 

(13) <^UU^{l/p)I p 
and 

(14) SYY f ^{\/p)SR 2 I p 
(if SR 2 < oo). 

The distinguishing characteristic of the class of spherically contoured 
distributions is that OY^Y for every orthogonal matrix 0 . 

Theorem 2,7.1* If Y has the density g(y l y) y then Z = OY y where O’O = I 、 
has the density g(z f z\ 

\ Proof. The transformation z^Oy has Jacobian 1. ■ 

We shall extend the definition of Y being spherically contoured to any 
distribution with the property OY = Y. 

Corollary 2.7.1. If Y is spherically contoured with stochastic representation 
Y^RU with R 2 ^ Y r Y y then U is spherically contoured. 

Proof. If Z^OY and hen:e Z 基 Y ， and Z has the stochastic representa¬ 
tion Z ― SV y where 5 2 = Z r Z t then S — R and V ~ OU 基 U. ■ 
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The density of X = v -f CY is (2). From (11) and (14) we derive the 
following theorem: 

Theorem 2.7.2. If X has the density (2) and SR 2 < oo y 

(15) = V ， ^(X) (1/p) sr 2 \. 

In fact if <^R m < oo, a moment of X of order /i ( 幺 m) is S , (X l - fJL x ) hl ••• 
{X p - ^i p ) h f , — #Zf、 … Zp SR h /S{ Xp^ h ^ where Z has the distribution. 
MO, 2) and h = h' + … +h pt 

Theorem 2.7 If X has the density (2), ^R 2 < oo, and f[c^(X)] - 
f[^(X)] for all c>0, then _ 

In particular p l} (X) = <r tj / y/o~o^ = A /; / where 2 =(〜） and A = 

(V. 

2.7.2. Distributions of Linear Combinations; Marginal Distributions 

First we consider a spherically contoured distribution with density g(y f y\ 
Let y' = (y,,y 2 X where y { and y 2 have q and p — q components, respec¬ 
tively, The marginal density of y 2 is 

/ OC ^00 

… j +3^2) 办 I …办 

-00 J —00 ^ 

Express y { in polar coordinates (4) with r replaced by r } and p replaced by 
q. Then the marginal density of y 2 is 

r 00 

(17) &(3^ 2 ) g ( 彳 +y 2 _y 2 )r 厂 1 办 ,. 

This expression shows that the marginal distribution of y 2 has a density 
which is spherically contoured 、 

、 Now consider a vector X f = ( 尤⑴ ’ ，义 ⑵’) with density (2). If <SR 2 < oo t 
the covariance matrix of X is (15) partitioned as (14) of Section 2.4. 
Let Z(" =1 ⑴一 2 12 221( 2 )- 尤⑴ 〜 A 12 A9l( 2 ) ， Z (2) = X (2 \ t (1) = 
v ⑴ 一 U& 1 v (2) = v ⑴ 一 A 12 A^ 2 l v (2) ， t (2) = v (2 \ Then the density of Z’= 
(Z ⑴ , ， Z (2 ”)is 

(18) lA 1 i. 2 r^lA 22 r^[(z m -T^)^A ll . 2 (^-T (,) ) 

+ (z i 2 ) -v^yA 22 (z^~v^)}. 
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Note that Z ⑴ and Z ⑵ arc uncorrelatcd even though possibly dependent. 
Let Cj and C 2 be qXq and (p — q)X(p — q) matrices satisfying CjAy/.jCj 
=/ 9 and C 2 A 22 Cj ^ Define 夕 (1) and 夕⑵ by z ⑴一 t ⑴ =C l4 y (1) and 
z (2) - v (2) = C 2 y (2 \ Then F 0) and Y (2) have the density g(y (l), y (]) 

The marginal density of Y (2) is (17)，and the marginal density of X {2) =Z (2) is 

(19) IA 22 | ^g 2 [(x^ - - v( 2 ))l 

= C(< g 卜卜 (沪 

The moments of Y 2 can be calculated from the moments of Y t 

The generalization of Theorem 2.4.1 to elliptically contoured distributions 
is the following: Let X with p components have the density (2). Then Y—CX 
has the density |CAC| ~ ^g[(x — Cv) / (CAC / )'" i (a: — Cv)] for C nonsingular. 

The generalization of Theorem 2.4.4 is the following: If ^ has the density 
(2), then Z = DX has the density 

(20) iDAV'l^^g^iz-DvyiDAD^-'iz-Dv)}, 

where D \$ d qXp matrix of rank q <,p and g 2 is given by (17). 

We can also characterize marginal distributions in terms of the represen¬ 
tation (9). Consider 


( 2i ) 卜 ㈣ ( 3 ; 

where F ⑴ and U (}) have q components and F( 2 ) and U ( 2 、have p ~q 
components. Then R\ = F (2) T (2) has the distribution of R 2 U i2)f U (2 \ and 


( 22 ) 




U (2)7 U i2) A Y (2)/ Y (2) 
U r U = YY 


In the case Y — N{0 } I p \ (22) has the beta distribution, say B{p 一 q ， q), with 
density 


(23) 


F(p/ 2 ) n, 

r(9/2)r[( P - 9 )/2] 2 


(l-z) 


1 


0<2 < 1 . 


Hence, in general, 
(24) 


Y (2) = R 2 V t 



where R\ = R 2 b, b ^ B{p - q,q)^ V has the uniform distribution of v f v — 1 
in p 2 dimensions, and R 2 , b ，and V are independent. All marginal distribu¬ 
tions are elliptically contoured. 

2.7*3. Conditional Distributions and Multiple Correlation Coemcient 

The density of the conditional distribution of y Y given y 2 when y 
has the spherical density g(y f y) is 

+y 2 > , 2) = +r 2 2 j 

U J g 2 (y 2 ^) — g 2 (r!) ， 

where the marginal density g 2 (y 2< y 2 ) is given by (17) and r\ In terms 

of y [y (25) is a spherically contoured distribution (depending on r\\ 

Now consider X — (X\ y X' 2 y with density (2). The conditional density of 
尤 (1) given X (2) = x (2) is 4 

(26) ' 

IA U . 2 I —g{[U ⑴- v ⑴） ， 一 （ jc( 2 ) — v ( 2 )〕l】 八 7 /_ 2 [无 (1 ) — v ⑴一 B(x (2) - v( 2 ))] 

+ (x (2)^ p (2 ) y A - 2 l U (2)^ v (2) ) J 

于 g 2 [( x (2) 一 V ( 2 ) y \22 { x(2) ~ v(2) )] { 

— I A m . 2 I ， 5 g{[JC (l) — V ⑴ 一 B(JC ⑺ 一 V ⑵） ] ' 八 7i l ,2 [ 文 (1) - P (,) — B(x (2) — M ⑵） ] +4} 

十❿ •!)， • 

where r\ = (x (2) — v ( 2 ) Y\ 22(x^ 2) — v ⑵） and = A I2 A^ 1 • The density (26) is 
elliptically contoured in x (l) - v (I) - ⑵一 v (2) ) as a function of x (l \ The 
conditional mean of X ([) given X (2) =x (2) is 

(27) Z(X (I) |x (2) ) =v (n + B(;c (2) -i^ 2) ) 

if - fj) <co in (25)，where /?j = d Also the conditional covari¬ 
ance matrix is It follows that Definition 2.5.2 of the partial 

correlation coefficient holds when io- irq ^ l p ) = S 1It2 = 2 n + 
and X is the parameter matrix given above. 

Theorems 2.5.2, 2.5.3, and 2.5.4 are true for any elliptically contoured 
distribution for which SR 1 < 00 . 


2.7*4. The Characteristic Function; Moments 

The characteristic function of a random vector Y with a spherically con¬ 
toured distribution Se ii，y has the property of invariance over orthogonal 
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transformations, that is, 

(28) 及 〆 0 ^ = I" … f e ,r0y g(3^) 办 j …也 

j -00 J —00 y 

=f … f e ,ri g{z'z) dx x -■ dz 

J _CO J —CO F 

- w z ， 

where Z = OF also has the density g(y f y)^ The equality (28) for all orthogo 
nal O implies Se u，z is a function of t’t. We write 

(29) 0(0). 

Then tor X= ^ + CY 

(30) Si li，x = e u ^Se u，Cy 

= e"WCC7) 

when A = CC f . Conversely, any characteristic function of the form 
e it ， lL (f)(t , At) corresponding to a density corresponds to a random vector X 
with the density (2). 

The moments of X with an elliptic ally contoured distribution can be 
found from the characteristic function e li lL <p(d) cn from the representa¬ 
tion A" = |ju 十 RCU, where C’ 八一 1 C = J. Note that 

、 

(31) <^R 2 = C(p) f:rP +i g(r 2 )dr= -2/ ? 0 , (O), 

(32) ^ 4 = C(p) frP +3 g( r 2 )dr= 4p( p + 2) (0). 

f Consider the higher-order moments of Y — RU, The odd-order moments 
ot R are 0, and hence the odd-order moments of Y are 0. 

We have , 

(33) 名 - fi k ) ^0. 

In fact, all moments of X — of odd order are 0. 

Consider 濃 U!UjU k U 卜 Because V'U = 1, 

p 

(34) 1- £ = p +p(p-l) 
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Integration of S sin 4 0 t gives = 3 / [p(/? + 2)]; then (34) implies 

SU{U ：： -\/[p{p^2)\ Hence ZY ； 4 = 3^7? 4 /[p(p + 2)] and 
SR A /[p{p + 2)]. Unless i =j = k = l or i=j^k=l or i — k^j = l or 
i — k. wc have ― 0. To summarize = (〜〜+ 

8 lk 8 jt + 5"5 々） /[p(p + 2)】. The fourth-order moments of X are 

(35) ^(X,- 〜 )(AW,) 

名 R 4 /v f v t v v N 

= p(p + 2) ( 入 i; 入 ^)i + 入 " 入鉢） 

烈 4 P , 、 

= ⑽ 2)2 + 〜％ + A % ) _ 

The fourth cumulant of the ith component of X standardized by its 
standard deviation is 


(36) 



3<^R 4 

d 2 


， P( P + 2) 

I P J 


[^{X,~ I ^ Rl ) 


P ' 

— *2 

r(0) 1 

(^R 2 ) 2 p + 2 

— 3 

1_[ 綱 2 


= 3k ， 


say. This is known as the kurtosis. (Note that k is ^<S > {(X t — ^) 4 / 
— 弘 £ ) 2 ] 2 } — 1.) The standardized fourth cumulant is 3k for every 
component of X. The fourth cumulant of X n X jy X k ，and X t is 


(37) 

K ,,ki = 6)(4 - ^/) ~ + Uji + ^^jk) 

= + fT)l + %/«>)• 

For the normal distribution 0. The fourth-order moments can be written 


( 38 ) 一 / 0 (' — — fi k )(X { - fij) 

=(l + K)(a t) a k ，+ a lk (jj! + a n o) k ). 

More detail about elliptically contoured distributions can be found in Fang 
and Zhang (1990). 
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The class of elliptically contoured distributions generalizes the normal 
distribution, introducing more flexibility; the kurtosis is not required to be 0. 
The typical “bell-shaped surface” of | 八 「 %[(>: - v)’A -1 (x — v)] can be 
more or less peaked than in the case of the normal distribution. In the next 
subsection some examples are given. 


2,7.5, Examples 

(1) The multivariate udistnbution. Suppose Z 〜 N(fi ， I p ) ， ms 2 = Xm^ and Z 
and s 2 are independent. Define Y^(l/s)Z. Then the density of Y is 


(39) 





r ( y ) m ^/ 2 ^/ 2 



y f y s 


m+p 


and 

(40) 


R 2 \\Y\\ 2 v m Xp 

- =■■ ■ ^ -- —— 

p p p ， m p xf 


If 太 =|jl 十 CK, the density o'* X i.s 


r 


m 




(41) 


2 J 


r| 引 m" 2 〆 2 


|AP 


1 M-) 

m 


j(m +p ) 


(2) Contaminated normal The contaminated normal distribution is a mix¬ 
ture of two normal distributions with proportional covariance matrices and 
the same mean vector. The density can be written 


(42) 




(2^) p/2 \\\" 


‘ 一 ft) •八又一 fJL) 


(2tt) (,/2 |cA| 


(l/2< )(i：- k-)'A |t) 


where c > 0 and 0 < ^ < 1. Usually e is rather small and c rather large. 

(3) Mixtures of normal distributions. Let w(u) be a cumulative distribution 
function Over 0 < y < oo. Then a mixture of normal densities is defined by 

(43) ^ n dw(u), 
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which is an elliptically contoured density. The random vector X with this 
density has a representation X = wZ, where Z ~ N(\i, 2) and w ~iv(hO are 
independent. 

Fang, Kotz, and Ng (1990) have discussed (43) and have given other 
examples of elliptically contoured distributions. 


PROBLEMS 

2.1. (Sec. 2.2) Let f(x,y)=],0^x^],0<y^1, 

= 0, otherwise. 

Find ： 

(a) F(x,vl 

(b) F(x). 

(c) j\x\ 

(d) f{x\y). [Note: /U 0 |_y 0 ) = 0 if /U 0 , 少 0 ) = tL] 

(e) ^X ,! Y' n . 

(0 Prove X and Y are independent. 

2.2. (Sec. 2.2) Lcl f(x,y) = 2, 0 <_v <x < 1, 

= 0, olhci'wiiic. 

Find ： 

(a) F(x,y) 

(b) Fix). 

(c) fix). 

(d) G(y). 

(e) g ( 少 ) • 

2.3. (Sec. 2.2) Let f(x,y) = C for x 1 +y 2 < k 2 and 0 elsewhere. Prove C = 

\/brk 2 \ SX= ^7=0, ^Y 2 = k 2 /A, and ^XY=0. Are Z and Y 

tndepcrdent? 

2.4. (Sec. 2.2) Let F(x lt x 2 ) be the joint cdf of Xj, X 2y and let ‘） be the 

marginal cdf of X r i = 1,2, Prove that If /•'( jc ,-) ts continuous, i = 1,2, then 

^(文卜 t 2 ) is continuous. 

2.5. (See. 2.2) Show that tf the set X t ， ". ， AT r is independent of the set 
X r +' 、 • • • 、 Xp ，then 


(f) f(x\yl 

(g) f(y\xl 

(h) SX n Y m . 

(i) Are X and Y independent? 


’ 容 ( 义 】， ••• ， 尤 ) 六（尤 … >•-- »= S"g( , X r ) X r+ i ， … ， Xp 丫 
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L6. (Sec. 23) Sketch the ellipses f(x 9 y) 
normal density with 


0,06， where f(x,y) is the bivariate 


(a) = 1, = 2, <j^ = 1, oy 2 = 1, p xy = 0. 

(b) 仏 = 0, jn y = 0, cr/ = 1, = 1, p xy = 0, 

(c) 〜 = 0, ii y = 0, o-/ = 1 , o ^ 2 == 1 , = 0,2 

(d) 〜= 0, ii y ^ 0, a x 2 = 1, (T y ? = 1, p xy = 0.8, 

(e) 士 = 0, jUy = 0, < = 4 ， <r y 2 = 1 ， p Ay = 0.8. 

2,7, (Sec. 2.3) Find b and A so that the following densities can be written in the 
form of (23). Also find fi xl fi y , o- x , <t v and p x ^. 

(a) 去 exp(—|[U - l ) 2 + ( 少一 2 ) 2 ]}. 


(b) 


2Att 


:xp(- 


x 2 /A - 1.6jiy/2 
QJ2 


(c) 2 ^exp[-^(jc 2 + 少 2 + 4x- 6y + 13)]. 

(d) ^ ： exp [— \(lx 2 -^y 2 + 2xy - 22x - 14y 4- 65)]. 


2.8. (Sec, 2.3) For each matrix A in Problem 2.7 find C so that C f AC = L 

2.9. (Sec. 13) Let * = 0. 


A = 


7 3 

3 4 

U 1 


2 \ 

1 

2} 


(a) Write the density (23). 

(b) Find 2. 

j 

2.10. (Sec. 2.3) Prove that the principal axes of (55) of Section 23 are along the 45® 
and 135° lines with lengths 2yc(l -f p) and 2^c(l :p) ， respectively, by 
transforming according to : yi =( 〜 ^z 2 )/ ^,y 2 = (z { -z^/yjl. 

2.11. (Sec. 2.3) Suppose the scalar random variables X lim ..^X n are independent 
and have a density which is a function only of … +Jt^. Prove that the X { 

..are normally distributed with mean 0 and common variance. Indicate the 
mildest conditions on the density for your proof. 
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2-12 - (Sec. 2.3) Show that if Pr{A" > 0, y > 0} = a foi the distribution 

then p =■- cos(l — 2or)7r. [Hint ： Let X= V,Y — pU + yl —p^F and verify p = 
cos27t(I — or) geometrically.] 

2.13. (Sec. 2.3) Prove that if p (； = p, / /, /,; = 1,..., p, then — l/(p — 1). 

2.14. (Sec. 23) Concentration ellipsoid. Let the density of the p-component Y be 
f(y) — r(4p + \)/[(p + 2 ) 77 ]- p for y f y <p + 2 and 0 elsewhere. Then SY = 0 
and trYY , = l (Problem 7.4). From this result prove that if the density of X is 
g(x) = \j\A\ V(ip + l)/[(p + 2)tt]^ p for (x- iiYAix - |i) <p + 2 and 0 else¬ 
where, then fX = pi and AX— — ix)' =/!〜】• 

2.15. (Sec. 2,4) Show that when X is normally distributed the components are 
mutually independent if and only if the covariance matrix is diagonal. 

2.16. (Sec, 2.4) Find necessary and sufficient conditions on A so that AY + \ has a 
continuous cdf. 

2.17. (Sec. 2.4) Which densities in Problem 2.7 define distributions in which X and 
Y lire independent? 

2.18. (Sec. 2.4) 

U) Write the marginal density of X for each case in Problem 2.6. 

(b) Indicate the marginal distribution of X for each case in Problem 2.7 by the 
notation N{a 、 b\ 

(c) Write Ihe marginal density of and X 2 in Problem 2.9. 

2.19. (See. 2.4) What !s Ihe dislribulion of Z = X - Y when X and Y have each of 
the densities in Problem 2.6? 

2.20. (Sec. 2.4) What is the distribution of + 2X 2 ~ 3^ 3 when X ]9 X 2l have 
the distribution defined in Problem 2.9? 

2.21. (Sec. 2.4) Let X = X 2 )^ where X y = X and X 2 = aX + b and X has the 

distribution W(0,1). Find the cdf of X. 

2,22 ， (Sec. 2.4) Let X {9 ... 9 X N be independently distributed, each according to 

N( fJL y O- 2 ). 

(a) What is the distribution of X = {X { . X N Y? Find the vector of means 

and the covariance matrix. 

(b) Using Theorem 2.4.4, find the marginal distribution of X = 1.XJN. 
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2.23. (Sec. 2.4) Let X N be independently distributed with X { having distri¬ 

bution N { 铃 + yz i7 a 2 \ where is a given number, I = l ， … ， N ，and L,z- = 0. 

(a) Find the distribution of y X N ) r . 

(b) Find the distribution of X and g = T t X i z l /T t z} for Ezf > 0. 


2.24. (Sec. 2.4) Let {X ]y Y } )\ (X 2> Y 2 )\ (X 3 , Y 3 y be independently distributed, 
(X r Y t y according to 


.卜、 

I, 



\ 

V , 

L \ ^ 

r 


a yy j 

i 


/ = 1,2,3. 


(a) Find the distribution of the six variables. 

(b) Find the distribution of (X t Y)\ 

225. (Sec. Z4) Let X have a (singular) normal distribution with mean 0 and 
covariance matrix 


X 




(a) Prove X is of rank 1. 

(b) Find a so X = a , Y and Y has a nonsingular normal distribution, and give 
the density of K 

2.26. (Sec. 2.4) Let 

r 2 -1 3、 

2 = - 1 5 一 3 • 

^ 3 -3 5 / 


(a) Find a vector h # 0 so that Xu = 0. [Hint; Take cofactors of any column.] 

(b) Show that an> matrix of the form G = {H «), where is 3 X 2, has the 
property 

(H^H 0\ 

G2G= t o oj- 

(c) Using (a) and (b), find B to satisfy (36). 

(d) Find B~ x and partition according to (39). 

(e) Verify that CC f = 1. 

2*27. (Sec. 2.4) Prove that if the joint (marginal) distribution of X { and X 2 is 
singular (that is, degenerate), then the joint distribution of X u X 2 , and X 3 is 
singular. 
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2.28. (Sec, 2,5) In each part of Problem 2.6, find the conditional distribution of X 
given Y = y, find the conditional distribution of Y given X and plot each 
regicsrion line on lhe approprialu graph in Problem 2.6. 

2.29. (Sec. 2,5) Let ^u = « and 

1 1. 0.80 -0.40\ 

0.80 I. -0.56 . 

-G.40 -0.56 1., 


(a) Find the conditional distribution of X v and X 2y given X 2 —x 2 . 

(b) What is the partial correlation between X x and given X{! 


2,30. (Sec. 2.5) In Problem 2.9, find the conditional distribution of X } and X 2 given 

2-31. (Sec. 2.5) Verify (20) directly from Theorem 2.5.1. 

2.32. (Sec. 2.5) 


(a) Show thal finding a to maximize the absolute value of the correlation 
between X t and (x f X {2) i.s equivalent to maximizing (<rf n Q£) J subject to 
od 22 ot conslanl. 

(b) Find ex by maximizing (crJ 0 oc) 2 - A(oc f ^ 22 o£ -c), where c is a constant and 

A is a Lagrange multiplier. 麥 

. § 

2.33 - (Sec. 2.5) Invariance of the mulliple correlation coefficient. Prove that R hq + 1> > 

is an invariant characteristic of the multivariate normal distribution of X { and 
X ⑺ under the transformation x* = -f c, for b f ¥= 0 and -ft 

for H nonsingular and that every function of a i0 cr(,) ， pS 2 \ and 2^ that is 
invariant is a function of + 1 

2.34. (Sec. 2.5) Prove that 






Pki 



k y j^q ^ 1 ， •••，/>• 


2.35, (Sec. 2.5) Find the multiple correlation coefficient between X x and (X 2y 
in Problem 2.29 ‘ 

2.36* (Sec. 2.5) Prove explicitly that if 2 is positive definite, 


1X1 =|S M -2,25:^22,1-|2 22 l. 
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2J7- (Sec- 2.5) Prove Hadamard's inequality 

i 2 i < n<v 

/-I 

[Hint: Using Problem 2.36, prove 121 S<r u |S 22 |，where S 22 is (p - 1)X 
(p -l\ and apply induction.] 

2-38 - (Sec. 2.5) Prove equality holds in Problem 2.37 If and only if S is diagonal. 

2-39. (Sec. 2.5) Prove = = ^ ^\ 3-2 = (T n- 2 / (r 33 a = 

Pi3 2 ^i- 2 A 3 27 where a t 2 k = (r u , k . 

2.40- (Sec. 2.5) Let (X v X 2 ) have the density n (x|0, 2) =/(jc,, jc 2 ). Let the density 
of X 2 given X x be Let the joint density of X v X 2y X^ be 

f(x u x 2 )f(x 3 \x ] \ Find the covariance matrix of X u X 2t X^ and the partial 
correlation between X 2 and X 3 for g^ven X x . 

2.41. (Sec. 2*5) Prove 1 - = (1 - p^Xl - [Hint: Use the fact_that the 

variance of X^m the conditional distribution g^ven x 2 and x 3 is 

2.42. (Sec. 2,5) If p = 2, clh there be a difference between the simple correlation 
between X x and x 2 and the multiple correlation between X x and =X 2 ? 
Explain. 

2.43. (Sec. Z5) Prove 


n 一 ^ik q- 1 y /c + l 

，…， k —'， k+ '”.'P — ^, + ,,...,*- 1 ^+!,...,^ 


^rq+ l »ic+ 1 y .., y p 

Pikq - X … 卜 ““ 1 ，‘ ... p ^. q+ x . k -i, t+I . / 


w f •雄 k ^ q + p, where 0 -/ 9+ , p * 

(Tjj. q+}k -i. k+ i ，…， p , j = i, k. [Hint: Prove this for the special case fc = <? + l 
J by using Problem 2.56 with Pi P 2 = P 3 = P 一 9 一 U 

_ — 

2.44. (Sec- Z5) Give a necessary and sufficient condition for . . - 0 in terms 

释 Of ^|* B ^ +1 > • ■ • > p» 

2.45. (Sec- 2.5) Show 


1 一只？ *9+1 ，… = (1 — — pf , p -\^ p ) "" — ft%-l‘q+2 


[Hint: Use (19) and (27) successively.] 
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2.46 - (Sec. 2.5) Show 

2,47. (Sec. 2.5) Prove 



[Hint: Apply Theorem A.3.2 of the Appendix to the cofactors used to calculate 

Z] 


2,48. (Sec. 2.5) Show that for any joint distribution for which the expectations exist 
and any function h(x^ 2) ) that 

- ^X i \X (2) )h(X^)=0. 

[Hint: In the above take the expectation first with respect to X t conditional 
on A：( 2 ).] 

2-49 - (Sec. 2.5) Show that for any function h(x^ 2) ) and any joint distribution of X t 
and AT (2) for which the relevant expectations exist, <S[X ( — h(X^)] 2 = <S[X t — 
g(X {2) )] 2 + <S[g(X^ 2) ) - hiX^)] 2 , where g(x^ 2) )= is the conditional 

expectation of X ( given X (2) = x (2) . Hence g{X (2) ) minimizes the mean squared 
error of prediction. [ Hint: Use Problem 2.48.] 

2.50. (Sec. 2.5) Show that for any function h(x (2) ) and any joint distribution of X L 
and X^ 2) for which the relevant expectations exist, the correlation between X { 
and h(X^ 2) ) is not greater than the correlation between and g(A^ 2 ))，where 
g(x^) = <gX ( \x (2 K 

2.51. (Sec. 2.5) Show that for any vector function fc(x( 2 )) 

<^[x^-h(X (2) )] [a ： (1) - ；2 (A ： (2) )]'-^[A： (1) - (TA ： (1) |A ： (2 〉 ][A ： ⑴一 

is positive semidefinite. Note this generalizes Theorem 2.5.3 and Problem 2.49. 

2.52. (Sec. 25) Verify that = 一平 lV^i 2 » where ^ = 2" 1 is partitioned 

similarly to 2. 


2.53. (Sec. 2.5) Show 



2 - 1 v v ■ 
11 2^12^22 




22 ^ 21^112 


V1 V — 1 v V—i 丄 v—i 

i 22 ^21^11^2^12 ^22 卞厶 22 






-P). 


where P = UA [Hint: Use Theorem A.3.3 of the Appendix and the fact 
that 2 _, is symmetric.] 
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2.54. (Sec. 2.5) Use Problem 2.53 to show that 


2.55. (Sec. 2.5) Show 

A；^ 1 ^ 2 ^ 3 )) = W 1 ) + 

+ (^ 12 — 2 1 3233 1 2 3 2)(222 — ^ 23 ^ 33 * ^* 32 ) 

•[w-p-saw)]. 

2.56. (Sec* 2.5) Prove by matrix algebra that 


1 ^22 

艺23 、 

_ /2 2t ' 

\ ^32 

2 33 J 



- ^\3^33^3\ 


一 (2 12 — 2 13 2 33 l 2 32 )(2 22 — 223^33 S 32 ) (2 21 — 2 232J3 1 2 3 i). 

2.57. (Sec. 2.5) Invariance of the partial correlation coefficient • Prove that p 12 . 3 p is 
invariant under the transformations xf =a i x i + + c /t a t > 0, i = 1,2, x (3) * 

= Oc (3) + d, where x (3) = (x 3 ， … ， x p )’，and that any function of |Jt and 2 that is 
invariant under these transformations is a function of p 12 , 3 . p . 


2.58. (Sec. 2.5) Suppose A" ([) and X (2) of q and p -q components, respectively, 
have the density 

(277) 如 ’ 

where 

Q = (x( l ) — (x ⑴） '/( u ( jc ⑴一 V”）+ (X ⑴一 |x ⑴） M l2 (x( 2 ) — fx( 2 )) 

+ (x( 2 ) — |X (2) V/I 21 (X (”一 /)) + (x ⑺一 |X (2) )V1 22 (X( 2 ) —! x( 2 ’）. 

Show that Q can be written as Q { 4 - Q 2y where 
Q,= [( 沪 1( 。） +A ； l ]A n (x^ - |x( 2 ))]、 u [(: r ⑴ - ， ) ） 十❿ n ( 工 (2) - #))1 
Q2 = (x (2) - ^ 2) Y(A 22 - A 2 iAn l Ai 2 )(x^ - ^ 2) ). 

Show that the marginal density of X( 2 ) is 

1-^22^^ 21 ^ 12 I 2 ^ \q 2 

^(2 订 )“ 门 ) • 

Show that the conditional density of 太⑴ given X (2) = x( 2 ) is 

1^-1 c -i e , 

(2tt)' 9 

(without using the Appendix). This problem is meant to furnish an alternative 
proof of Theorems 2.4.3 and 2.5.1. 
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2.59 - (Sec. 2.6) Prove Lemma 2^6.2 in detail, ^ 

2.60. (Sec. 2.6) Let Y be distributed according to MO, 2X Differentiating the 
characteristic function, verify (25) and (26). 

2.61. (Sec. 2.6) Verify (25) and (26) by using the transformation X - |x =» CY, where 
2 = CC\ and integrating the density of Y. 

2.62. (Sec. 2.6) Let the density of (A", Y) be 

2az(jc| 0, l)n(y|0,l), 0<y <,x< oo, 0 ^ -x <>^ < oo, 

Q < -y <, -x <00, 0 <x ^ -y <q^j 


0 otherwise. 

Show that X 9 Y t X + Y,X — Y each have a marginal normal distriburon. 

2.63. (Sec. 2.6) Suppose X is distributed according to MO, 2). Let 2 = (o^，，”，〜)- 
Prove 


玄（ XX ， ^Ja f ) = 2 公 I + vec 2 (vec 2) f + : 

=(/ + K)(2 02) + vec2 (vec2 )，， 

where 






* 

vec 2 = 

_ a P 

， K = 


* 


u P a， p 


and is a column vector with 1 in the fth position and O^s elsewhere. 


2.64. Complex normal distribution. Let (X\ Y 1 )* have a normal distribution with mean 
vector ( 心， |xV)' and covariance matrix 


1 




where r is positive definite and = 一中 ' (skew symmetric). Then Z —X + 
is said to have a complex normal distribution with mean 0 — |l^+ and 
covariance matrix S{1 — 0)(Z - 0)* = P = Q + iR, where Z* — iY\ Note 
that P is Hermitian and positive definite. 


(a) Show Q = 2T and R = 2 ❿. 

(b) Show |P| 2 = |22|. [Hint: |r + f 屯 | = | f - 沖 |.] 
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(c) Show 

p~ l = (G + Rfi -1 !?) -1 + 

Note that the inverse of a Hermitian matrix is Hermitian. 

(d) Show that the density of X and Y can be written 

2»$5. Complex normal (continued). If Z has the complex normal distribution of 
M Problem 2.64, show that where Z is a nonsingular complex matrix, has 

w the complex normal distribution with mean AQ and covariance matrix = 
W APA*. 

2.66. Show that the characteristic function of Z defined in Problem 2.64 is 

及 〆 斯 b*ZJ ^ e t^u*Q-u*Pu 

where diix + iy) 

2.67. (Sea 2.2) Show that /2 dx/}/2 tt is approximately (1 - e’ 2fl2/ir ) 1/2 . 

[Hint: Tlie probability that (X 7 Y) falls in a square is approximately the 
probability that (X,Y) falls in an approximating circle [P6lya (1949)1] 


2.68. (Sec 17) For the multivariate ^-distribution with density (4l) show that 
(SX- |i and €{X) = [m/(m - 2)]A. 



CHAPTER 3 


Estimation of the Mean Vector 
and the Covariance Matrix 


3 丄 INTRODUCTION 

The multivariate normal distribution is Specified completely by the mean 
vector (jl and the covariance matrix 2. The first statistical problem is how to 
estimate these parameters on the basis of a sample of observations. In 
Section 3.2 it is shown that the maximum likelihood estimator of ji is the 
sample mean; the maximum likelihood estimator of 2 is proportional to the 
matrix of sample variances and covariances. A sample variance is a sum of 
squares of deviations of observations from the sample mean divided by one 
less than the number of observations in the sample ； a sample covariance is 
similarly defined in terms of cross products. The sample covariance matrix is 
an unbiased estimator of 2 . 

The distribution of the sample mean vector is given in Section 3.3, and it is 
shown how one can test the hypothesis that |ji is a given vector when 2 is 
known. The case of X unknown will be treated in Chapter 5. 

Some theoretical properties of the sample mean are given in Section 3,4, 
•and the Bayes estimator of the population mean is derived for a normal a 
priori distribulion. In Section 3.5 the James Stein estimator is introduced; 
nnprovcmcnis over ihc sample nican for ihc mean squared error loss func¬ 
tion arc discussed. 

In Section 3.6 estimators of the mean vector and covariance matrix of 
ellsptically contoured distributions and the distributions of the estimators arc 
treated. 


An Introducdon to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471*36091-0 Copyrighi © 2003 John Wiley & Sons, Inc. 
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3.2. THE MAXIMUM LIKELIHOOD ESTIMATORS OF THE MEAN 
VECTOR AND THE COVARIANCE MATRIX 

Given a sample of (vector) observations from a p-variate (nondegenerate) 
normal distribution, we ask for estimators of the mean vector \i and the 
covariance matrix S of the distribution* We shall deduce the maximum 
likelihood estimators. 

It turns out that the method of maximum likelihood is very useful in 
various estimation and hypothesis testing problems concerning the multivari¬ 
ate normal distribution. The maximum likelihood estimators or modifications 
of them often have some optimum properties. In the particular case studied 
here, the estimators are asymptotically efficient [Cramer (1946), Sec. 33J]. 

Suppose our sample of N observations on X distributed according to 
N(\i, 2) is x,,...,x iV> where N>p. The likelihood function is 

(1) L = n 

a= I 

\ E (:«- mO’ 艺 蚪） 

or™ I 

In the likelihood function the vectors x u … ， x N are fixed at the sample 
values and L is a function of |i and S. To emphasize that these quantities 
are variables (and not parameters) we shall denote them by and 2*. Then 
the logarithm of the likelihood function is 

(2) log L = ― \pN log27r— jN log|2*l 

a 画 l 

Since log L is an increasing function of L, its maximum is at the same point 
in the space of 2* as the maximum of L. The maximum likelihood 
estimators of and 2 are the vector p.* and the positive definite matrix X* 
that maximize log L. (It remains to be seen that the supremum oi l )gL is 
attained for a positive definite matrix 2*.) 

Let the sample mean vector be 

TV ^ X la 
.N a=I 

(3) i = 7v E : 

Of™ 1 1 W 

77 ^ x p a 



(2^)^i^^ exp 
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where x a = {x la 7 ... 7 x pa y and x, = Y t ^ i x ia /N 7 and let the matrix of sums 
of squares and cross products of deviations about the mean be 

* ■ 

(4) A= Y. (^ a -x)(x a -xY 

a— 1 

'N ' 

- L —勺）， u = 1 ， _•_，/?■ 

ot-i 

It will be convenient to use the following lemma: v 

.ft 

Lemma 3.2.1. Let ，…， x N be N {p<omponent) vectors, and let x be 
defined by (3). Then for any vector b 

(5) jE - L (X a -x)(x a -x)' +N(x-b)(x-b)'. 

a - 1 ot 讎 l 

Proof 

⑹ 

E (x a -i)(x a -i) , = £ [(x a -x) + (x-b)][(x a -x) + (x~b)]' 

a_l or* 1 

- E + (〜一无 )（ 无 

ami 

+ (x-i)(x„ -x) f + (x-xft)(x-i) f ] 

=E ( 工，无)(\ -叶 + E (x a -x) (x-by 
a* 1 L 1 

AT 

+ (x-ft) Yi ( x a - x) r + - b)(x - b)'. 

a-l 

The second and third terms on the right-hand iide are 0 because E(x a -x)» 
Lx a -Nx = 0by (3). ■ 

When we let b = ji.*, we have 

⑺ 

E (x a -M.*)(x a -M-*) , = E (^ a -x)(x a -iy +N(x-p*)(x r p*y 

a«l a»1 

=A + 
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Using this result and the properties of the trace of a matrix (tr CD = Ec fJ d fl 
=tr DC\ we ha/e 


⑻ 

L W'u* 卜 tr £ (h-it*)】* W) 

a* 1 a*l 

I =tTZ^*- l (x a - l L*)(x <x - v *y 

r a-1 

=tr + tr - \t*)(x - l t*) f 

= trX*'^ 

Thus we can write (2) as 


(9) log L = - {pN log(2ir)-iJVlogi2*| 

-itr 

Since 2* is positive definite, 2* -1 is positive definite, and N(x - 
|t*y2* _1 (x - jt*) ^ 0 and is 0 if and only if |t* = i. To maximize the second 
and third terms of (9) we use the following lemma (which is also used in later 
chapters): 

Lemma 322. If D is positive definite of order p, the maximum of 

(10) /(G) =JV|log!G| -trG ^D 

with respect to positive definite matrices G exists, occurs at G = (\/N)D, and 
has the value 


(11) /[(l/Af)D] =/)JVlogA^-Anog|Z)| -pN, 

Proof. Let D^EE r and E'G~ 1 E = H. Ther G = EH 1 E\ and |G| - |E| 
•Iff -1 ! - \E f \ - \H~ l \ - |£E f | = \D\/ Iffl, and tr G -1 D - tr G -1 ^ - 
ME'G'^E 3 * ttH. Then the function to be maximized (with respect to posi~ 
live, 如 finite ff) is 

(12) /= -hf log|D| +N logltfl -trff. 

Let where T is lower triangular (Corollaiy A.1.7)* Then the 

maxm&i of 

/« ~Nlog\D\ + TV login 2 -tr7T , 

=-N loglDl + E (JV logC -g) _ Eg 
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occurs at t; t ^ N, r f; 0, i ^/; that is, at H^NI. Then G = {l/N)EE f = 
(l/N)D. ■ 

Theorem 3.2.L 7/jc p …， x N constitute a sample from N(fL，X) with p <N, 
the maximum likelihood estimators of (x and X are |t = I = (l/N)T.a=,\ x a and 
£ = - xXx a -x)\ respectively. 

Other methods of deriving the maximum likelihood estimators have been 
discussed by Anderson and Oik in (1985). See Problems 3.4, 3.8, and 3.12. 

Computation of the estimate X is made easier by the specialization of 
Lemma 3.2.1 {b = 0) 

N N 

(14) E (^ a -^ a -xy- Zw 亂 

o=l a=l 

An element of E^ =1 jc a jc^ is computed as L^ =l x ia Xj a , and an element of 
Nxx 1 is computed as Nx t x } or (T.^ x x ia XL^ { x )a )/N. It should be noted 
that if N > p. the probability is 1 of drawing a sample so that (14) is positive 
definite ； see Problem 3.17. 

The covariance matrix can be written in terms of the variances or standard 
deviations and correlation coefficients. These are uniquely defined by the 
variances and covariances. We assert that the maximum likelihood estimators 
of functions of the parameters are those functions of the maximum likelihood 
estimators of the parameters. 

Lemma 3.2J. Let f{d) be a real-valued function defined on a set S，and let 
4> be a single-valued function , with a single-valued inverse, on S to a set 5*; that 
is, to each 6^ S there corresponds a unique 6* G 5*, and, conversely^ to each 
0* e 5* there corresponds a unique 0 E 5. Let 

(15) g(^)=/U-'(0*)]. 

Then if f{9) attains a maximum at d - 0 O) ) attains a maximum at 

6* = 0^ ^ 4>{0 0 ). If the maximum off(6) at 0 O is unique，so ts the maximum 
ofg{e M ) at 0J. 

Proof. By hypothesis f(d 0 ) >f(d) for all 6^ S. Then for any 0* G S* 

( 16 ) =f{0) </( 〜） -g[<t>W] =■?( 吋 ). 

Thus g(0*) attains a maximum at 9^. If the maximum of f(d) at 0 O is 
unique, there is strict inequality above for 9 ^ 0 O , and the maximum of g(8*) 
is unique* ■ 
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We have the following corollary: 

Corollary 3-2*1. If on the basis of a given sample 6 m are maximum 

likelihood estimators of the parameters of a distribution y then 

cf> { (d [y , r ., 6 m \ ..., , 0 m ) are maximum likelihood estiniton of 

9 m ) if the transformation from d v ...,6 m to 
</>!，..., (}> m is one-to-one^ If the estimators of 8' ， … ， d m are unique, then the 
estimators of (f> m are unique• 

Corollarj 3.2,2. If x l9 .^. f x N constitutes a sample from iV((i ， X )， where 
d {) = ( p ti = 1), then the maximum likelihood estimator of pi is “ = x = 

(l/7V)E a JC a ; the maximum likelihood estimator of <r^ is » (l/N)E fl (jc ia — 
x { ) 2 = (l/7V)(E a Xj 2 a - Nx]) y where x ia is the rth component of x a andx i is the 
/th component ofx\ and the maximum likelihood estimator of p /; is 

(17) p：= 

” yjO )a -气 f 

= h X h X i 。 - 販 3) 

扣 K _ 4 -紙; 2 


Proof. The set of parameters ^ =■ <r t 2 — a ln and p i} = a^/ is a 
one-to-one transform of the set of parameters ^ and a V) . Therefore, by 
Corollary 3.2.1 the estimator of ^ is of a- 2 is & ii9 and of p j; is 


(18) 


Pn 




rz~z~ , 
V h%) 


■ 


Pearson (1896) gave a justification for this estimator of p 1; , and (17) is 
sometimes called the Pearson correlation coefficient. It is also called the 
simple correlation coefficient. It is usually denoted by r J; . 

tThe assumption ihal the transformation is one-lo-onc is made so Lhal the set (f >^. .. T 
uniquely defines the likelihoud. An alicrnativc in case 0 *docs not have a unique inverse 
is to define s(0*) = (0; (j>{6) = 0*) and g(0*) = sup f{0)\ which is considered the 

•‘induced likelihood” when f{0) is the likelihood function. Then 0* » maximizes g{0*\ 
for gi.0*)-sup f(,0)\Oe ： slo*)^supf{0)\O^S-f(,0) = g{0*) for all 0^ eS*. [See, e.g., 
Zehna (1966).1 



72 


ESTIMATION OF THE MEAN VECTOR AND THE COVARIANCE MATRIX 



A convenient geometrical interpretation of this sample (x p jc 2 , •■•，〜）= 
is in terms of the rows of X. Let 





、11 

… X 1N ) 


v/ 

(19) 

X = 

■ 

• 

= 

• 




■ ’. X pN j 




■V 


that is, u\ is the ith row of X. The vector u,. can be considered as a vector in 
an /^-dimensional space with the ath coordinate of one endpoint being x la 
and the other endpoint at the origin. Thus the sample is represented by p 
vectors in /^-dimensional Euclidean space. By definition of the Euclidean 
metric, the squared length of u, (that is, the squared distance of one 
endpoint from the other) is u]ui = Ea_i^? a - 

Now let us show that the cosine of the angle between u t and u] is 
u'iUj/ yju\u l u! j u i = L^ ml x ia x ]a /^J j x } a , xf a . Choose the scalar d so 
the vector du } is orthogonal to u t - duj ； that is, 0 = du^iui — duj) = diu^Ut — 
Therefore, d =■ u^ui/u^u^ We decompose u t mto u i -duj and duj 
[Ui = {u t -duj) + duj) as indicated in Figure 3.1. The absolute value of the 
c^ine of the angle between u i and u } is the length of du } divided by the 

length of u{ y that is, it is ^du t ] (duj)/u r i u [ - ^du^u^d/u^Ui ; the cosine is 
u ] u j/ yj^ f i u i ut ) u ) - This proves the desired result. 

: 1 r 

To give a geometric interpretation of a ti and a l} /a u a J} , we introduce 
the equiangular line, which is the line going through the origin and the point 
(1,1,..., 1)- See Figure 3.2. The projection of u, on the vector e « (1,1,. -iy 
is (e , « J /e , e)e = (E a x (a /E„l)e =x t e = {x t , x,)'. Then we decompose 
u, into x f e, the projection on the equiangular line, and u t -^e, t^e 
projection of u, on the plane perpendicular to the equiangular line. The 
squared length of u t —x,e is ( 〜一 — i,e) = E a (x ia - 无 ） 2 ; this is 
N&n = a ir Transhte u t -x s e and Uj—XjB, so tha^each vector has an end¬ 
point at the origin; the ath coordinate of the first vector is x ia —x„ and of 
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Table H Increase in Sleep 

Drug A Drug B 
Patient JCi x 2 




the second is x Ja - Xy The cosine of the angle between these two vectors is 

(Ut-XiZY^j-XjZ) 


( 20 ) 


r u 


y( k 广 i, e)'( k, - x l e)(w J - i) e )’(“ 厂 七 e ) 
N 

r (.«：；„-if) 


E (•rja *-^/) 2 E { X ja-^jf 


As an example of the calculations consider the data in Table 3.1 and 
graphed in Figure 3.3, taken from Student (1908). The measurement x„ = 1.9 
on the first patient is the increase in the number of hours of sleep due to the 
use of the sedative A r x 2! = 0.7 is the increase in the number of hours due to 


0.71.60.21.2o.l3.43.70.8ao2.0 


.9.8l..l.l.4*o.6.6.4 
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Figure 3-3. Increase in sleep, 


sedative B y and so on. Assuming that each pair (i.e” each row in the table) Is 
an observation from 2), we find that 


( 21 ) 


A • 


\3. - X - 


2.33 、 
0.75 7 


2 = 

5 = 


3.61 

2.56 、 

2.56 

2.88 J ’ 

4.01 

2.S5 丨 

2.85 

3.20；' 


and p ]2 = r n = 0.7952. (S will be defined later.) 


3.3. THE DISTRIBUTION OF THE SAMPLE MEAN VECTOR; 
INFERENCE CONCERNING THE MEAN WHEN THE COVARIANCE 
MATRIX IS KNOWN 

3J.1. Distribution Theory 

In the univariate case the mean of a sample is distributed normally and 
independently of the sample variance. Similarly, the sample mean X defined 
in Section 3.2 is distributed normally and independently of 2. 
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To prove this result we shall make a transformation of the set of observa¬ 
tion vectors. Because this kind of transformation is used several times in this 
book, we first prove a more general theorem. 

Theorem 3.3.1. Suppose are independent，where X a is dis¬ 
tributed according to 2). Let C = (c a ^) be an N X N orthogonal matrix. 

Then y a = T,^ { c a ^X^ is distributed according to where v a = 

^=-\ c a^ M*/ 3 ，a = 1 ， __■ ， AT, and Y [jm .. 7 Y N are independent. 

Proof. The set of vectors ..., have a joint normal distribution, 
because the entire set of components is a set of linear combinations of the 
components of …， X N , which have a joint normal distribution. The 
expected value of Y a is 

(1) ^y a -^Lc a ,x,= Ew% 

/ 3 = 1 ^ 1 

N 

=Eww 

The covariance matrix between F a and Y y is 

(2) ^(Y a ,Y；) = ^{Y a -v a )(Y y -v y y 

N N 

=^ E E c ye (x e -ti e y 

1 J L 1 

N 

= E C aP C y^{ X p ~ V-p)( X ,- V-cY 
I 

N 

=H C a^ C ye ^ ^ 

卩 ， Ml 

N 

=E C aP C yP^ 

/ 3=1 

= u ， 

where S ay is the Kronecker delta (= 1 if a= y and = 0 if a ^ y). 
This shows that Y a is independent of Y y7 a 丰 y, and 1^ has the covariance 
matrix X. _ 
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We also use the following general lemma: 

Lemma 3J.1. If C = (c a/} ) is orthogonal, then HLiX a x’ a = 11^ y a y ^， 
wherey a = L t p^ l c al3 x l3 , a=l,...,N. 


Proof 


(3) 


L yafa^ L L C a^pL C ay Xt y 


二 [ c ay : 

a J 

玲，？ 


XfiX r y 


_ 


Let be independent, each distributed according to JV(jjl ， S)_ 

There exists an NxN orthogonal matrix B = (b a ^) with the last row 


(4) 


(1 /氣 …，1 / 祝 ). 


(See Lemma A.4.2.) This transformation is a rotation in the A^-dimensional 
space described in Section 3.2 with the equiangular line going into the Nth 
coordinate axis. Let A= NX, defined in Section 3.2, and let 


(5) 

Then 

( 6 ) 


N 

Zq ： - 办 o ： 办尤办 . 

) 8=1 


0=1 j3=I ViV 


By Lemma 3.3.1 we have 


( 7 ) 


a- i ： x a x' a -m' 


Lz a z' a -z,z' s 


L z a z a 
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Since Z N is independent of Z 1? ..., Z N-l , the mean vector X is independent 
of A. Since 

N N , 

(8) 忒 E b N p SX^= E v77|t, 

Z N is distributed according to Niy/N 2) and X = {\/y[N)Z N is distributed 
according to N[ii,(l/N)X], We note 

(9) <^Z a = 灯广 Eb 一 

/3-1 芦 =1 

N 

=L K^ b N^ M- 

= 0 ， a 丰 N• 

Theorem 3,3 *2, The mean of a sample of size N from 2) is distributed 

according to A^[|t ，（ l/A02] and independently of S, the maximum likelihood 
estimator of 2. NX is distributed as where Z a is distributed 

according to N(Q, 2), a= 1,..., iV- 1, and Z l? ..Z N ^ l are independent. 

DeHnition 3.3*1. An estimator t of a parameter vector 0 is unbiased if and 
only if = 0. 

Since SX = (l/AT)^ ) E^- 1 A r a = |t, the sample mean is an unbiased estima¬ 
tor of the population mean. However, 

.1 iV-1 

(10) L z a z' a -—^-x. 

a-\ 

Thus X is a biased estimator of X. We shall therefore define 
■ 1 1 N 

( n ) = £ {x a -x)(x a -xY 

cc = l 


as the sample covariance matrix. It is an unbiased estimator of X and the 
diagonal elements are the usual (unbiased) sample variances of the compo¬ 
nents of X. 


3.3.2. Tests and Confidence Regions for the Mean Vector When the 
Covariance Matrix Is Known 

A statistical problem of considerable importance is that of testing the 
hypothesis that the mean vector of a normal distribution is a given vector. 
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and a related problem is that of giving a confidence region for the unknown 
vector of means. We now go on to study these problems under the assump¬ 
tion that the covariance matrix 2 is known. In Chapter 5 we consider these 
problems when the covariance matrix is unknown. 

In the univariate case one bases a test or a confidence interval on the fact 
that the difference between the sample mean and the population mean is 
normally distributed with mean zero and known variance; then tables of the 
normal distribution can be used to set up significance points or to compute 
confidence intervals. In the multivariate case one uses the fact that the 
difference between the sample mean vector and the population mean vector 
is normally distributed with mean vector zero and known covariance matrix. 
One could set up limits for each component on the basis of the distribution, 
but this procedure has the disadvantages that the choice of limits is some¬ 
what arbitrary and in the case of tests leads to tests that may be very poor 
against some alternatives ， and, moreover, such limits are difficult to compute 
because tables are available only for the bivariate case. The procedures given 
below, however, are easily computed and furthermore can be given general 
intuitive and theoretical justifications. 

The procedures and evaluation of their properties are based on the 
following theorem: 

Theorem 3,3-3. If the m-component vector Y is distributed according to 
N{v,T) {nonsingularX then Y f T~ l Y is distributed according to the noncentral 
X 2 -distrihution with tn degrees of freedom and noncentrality parameter v f T^ l v. 
ff v = Q , the distribution is the central 乂 2 也 她 ution. 

Proof. Let C be a nonsingular matrix such that CTC r = I， and define 
Z = CY, Then Z is normally distributed with mean CS'Y^ Cv = \ 9 say ， 
and covariance matrix €(Z — — 人 ）’ = SC{Y^ — v) r C r = CTC r =/. 

Then Y T~ 1 y = Z r {CT l T l C l Z ，Z r (CTCT l Z = Z Z, which is the sum of 
squares of the components of Z. Similarly vT" 1 v = \ f \. Tlius y r T^ l Y is 
distributed as E^li^ r 2 , where Z p ..., Z m are independently normally dis¬ 
tributed with means A l? •“ ， respectively, and variances L By definition 
this distributic n is the noncentral /-distribution with noncentrality parame¬ 
ter EH [ Af. See Section 3-3.3. If A, - ••- = A m = 0, the distribution is central. 
(See Problem 7.5.) ■ 

Since ^/N (X - |i) is distributed according to A^(0, 2), it follows from the 
theorem that 

(12) 
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has a (central) ^^distribution with p degrees of freedom. This is the 
fundamental fact we use in setting up tests and confidence regions concern¬ 
ing (JL. 

Let Xp( a ) be the number such that 

(13) PrU 2> 4 ( a ) 卜 a - 
Thus 

(14) ?r{N(X- 幻飞 - 1 ( 无 —(Jl) > ^ 2 (a)} = a. 

To test the hypothesis that \l = jjl 0 , where jjl 0 is a specified vector，we use as 
our critical region 

(15) M<o) >Xp(a). 

If we obtain a sample such that (15) is satisfied, we reject the null hypothesis. 
It can be seen intuitively that the probability is greater than a of rejecting 
the hypothesis if jjl is very different from jjl 0 , since in the space of x (15) 
defines an ellipsoid with center at |jl 0 , and when jjl is far from \i 0 the density 
of x will be concentrated at a point near the edge or outside of the ellipsoid. 
The quantity N(X- jjl 0 ) is distributed as a noncentral x 2 with 

p degrees of freedom and noncentrality parameter N(il - 一 jjl 0 ) 
when X is the mean of a sample of N from iV(jjL ? 2) [given by Bose 
(1936a) ，（ 1936b)]，Pearson (1900) first proved Theorem 3.3.3 for v = 0. 

Now consider the following statement made on the basis of a sample with 
mean x: “The mean of the distribution satisfies 

(16) W W 1 (卜 ⑷ 

as an inequality on jjlV’W e see from (14) that the probability that a sample 
will be drawn such that the above statement is true is 1 — a because the 
event in (14) is equivalent to the statement being false. Thus，the set of (x* 
satisfying (16) is a confidence region for 蚪 with confidence 1 — a. 

In the p-dimensional space of x 7 (15) is the surface and exterior of an 
ellipsoid with center jjl 0 , the shape of the ellipsoid depending on 2 一 1 and 
the size on (l/AT)^(a) for given In the p-dimensional space of 
(16)is the surface and interior of an ellipsoid with its center at x. If =/, 
then (14) says that the probability is a that the distance between x and jx is 
greater than y Xp{ a )/^ - 
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Theorem 3.3.4. Ifx is the mean of a sample of N drawn from iV(|x ， 2) and 
2 is known, then (15) gives a critical region of size a for testing the hypothesis 
li — |i 0 , and (16) gives a confidence region for |x of confidence 1 一 a:. Here 
Xp(<^) w chosen to satisfy (13X 

The same technique can be used for the corresponding two-sample prob¬ 
lems. Suppose we have a sample {^}, ot 二 I ， …， from the distribution 
M /x (l \ X ), and a sample {x^}, a = 1 , …， from a second normal popula¬ 
tion N{ X ) with the same covariance matrix. Then the two sample 
means 

(17) ^ (1) = jr E 4°. 妒 )= +L4 2) 

J or= 1 ^ a= 1 

are distributed independently according to A^[ fi ([ \ and 

/x (2) ,(l/iV 2 )S], respectively. The difference of the two sample means, 
j = jfd) - 5 ⑵， is distributed according to Mv’Kl/A^) + (l/iV 2 )]2}，where 
v = |jl ⑴— ii (2 \ Thus 

(is) 1 ^r 2 (y-vyi- 1 (y-v)^x l K^) 

is a confidence region for the difference v of the two mean vectors, and a 
critical region for testing the hypothesis |x (1) = |jl ( 2) is given by 

(19) n! 1 +N 2 ^ ^ (1) 4 (2 ) )’S -1 (,-r 2 )) >x p 2 (oi). 

Mahalanobis (1930) suggested (|x (1) - |x (2) )’X _l (|ji ⑴ — |x (2) ) as a measure of 
the distance squared between two populations. Let C be a matrix such that 
2 = CC r and let v (l) = C _1 |x (i \ / = 1,2. Then the distance squared is (v ⑴一 
v ( 2 ))’( v ⑴ 一 v (2) )，which is the Euclidean distance squared, 

3.3.3. The Noncentral \ ^distribution; the Power Function 

The power function of the test (15) of the null hypothesis that |x -= |x 0 can be 
evaluated from the noncentral ^^distribution. The central ^^distribution is 
the distribution of the sum of squares of independent (scalar) normal 
variables with means 0 and variances 1; the noncentral ^^distribution is the 
generalization of this when the means may be different from 0. Let Y (of p 
components) be distributed according to Nik, I\ Let (2 be an orthogonal 
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matrix with elements of the first row being 


( 20 ) 




Then Z = QY is distributed according to N(t ， I\ where 


( 21 ) 



and Let = Z，Z = £f =1 Z‘ 2 . Then W = has a 

distribution with p — 1 degrees of freedom (Problem 7.5), and Z x and W 
have as joint density 


( 22 ) 






^(p- 1)-1 Ik 


2^-°r[Kp-i)] 

= Ce~ 卜 :+ 4 + 〜心_ 3 V 


Ce~= (T ' + - Tf+H ) w^ (p - 3) ^ 


a ! 


where C~ l = 2^ p '/nr[j(p — 1)]. The joint density of V = IV + Z~ and Z 1 is 
obtained by substituting w = o-zf (the Jacobian being 1): 


(23) Ce-^ 2 ^(o-z!y lp - 3) l 

a=0 


The joint density of V and U = Z t / 4v is (dz { = 4vdu) 

(24) Ce - 如 2+1 。 >- 2) (1- £ T ° 

a —0 

The admissible range of z x given u is — to {v, and the admissible range 
of u is 一 1 to 1. When we integrate (24) with respect to u term by term, the 
terms for a odd integrate to 0, since such a term is an odd function of u. In 
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the other integrations we substitute u — )/s (du — \ds//s) to obtain 

(25) f 1 (1 — u 2 ) l(p ^ 3) u 2 ^ du = 2 f 1 (l ~ u 2 ) l(p 3 ^u 2 ^ du 

J o 

= f\l-sy (p ~ 3) s^^ds 
J o 

r[l(p-i)]r(ff+j) 

一 r(i …） ’ 

by the usual properties of the beta and gamma functions. Thus the density of 
V is 




-i(T 2 +u) ip-l y (T 2 )P V 译 r(ff + I) 

^ to ( 2 卯 r(i P + 0). 


We can use the duplication formula for the gamma function T(2p + 1) — (2j3)! 
(Problem 7.37), 


r(2)3 +1 卜 r( 0 + 会 ) r( 0 + i)2 2 V/^, 


to rewrite (26) as 




芦 ! r(|p + p) 


This is the density of the noncentral /-distribution with p degrees of free¬ 
dom and noncentrality parameter t\ 


Theorem 33.5. If Y of p components is distributed according to N(\, /), 
then V —Y f Yhas the density (28 )，where r 2 = X r X. 

To obtain the power function of the test (15), we note that y/N(X - |x 0 ) 
has the distribution N[yfN (|x - |x 0 ), 2]. From Theorem 3.3.3 we obtain the 
following corollary ： 

Corollary 3.3.1* If X is the mean of a random sample of N drawn from 
N(|x, 2), then N(X— |x 0 ) has a noncentral ^distribution with p 

degrees of freedom and noncentrality parameter N(|x — |x 0 )’2 M (|x — |x 0 ). 
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3A THEORETICAL PROPERTIES OF ESTIMATORS 
OF THE MEAN VECTOR 

3.4.1. Properties of Maximum Likelihood Estimators 

It was shown in Section 3.3.1 that x and S are unbiased estimators of |x and 
2, respectively. In this subsection we shall show that x and S are sufficient 
statistics and are complete. 

Sufficiency 

A statistic T is sufficient for a family of distributions of X or for a parameter 
0 if the conditional distribution of X given T^t does not depend on 0 [e.g .， 
Cramer (1946)，Section 32.4], In this sense the statistic T gives as much 
information about 0 as the entire sample X. (Of course, this idea depends 
strictly on the assumed family of distributions.) 

Factorization Theorem. A statistic t(y) is sufficient for 0 if and only if the 
density f(y\ 0) can be factored as 

(!) /(州0) ⑺， 小 ⑺， 

where 8] and h(y) are nonnegative and h(y) does not depend on 0. 

Theorem 3.4.1. Ifx v …、 x N are observations from N(\i， X )， then x and S 
are sufficient for \i and 2. If jjl is giuen, E^ 1 (^： ce - \i)(x a - |x) r is sufficient for 
Uf l is given, x is sufficient for |x. 

Proof. The density of X N is 

(2) n 

a =* 1 

N . 

= (27r)-W|2「> e xp -itrS - 1 E 

a= 1 

= (27r)"^|2r^ + (N-l)tr2- 1 5]}. 

The right-hand side of (2) is in the form of (1) for x ， S 9 |x, 2, and the middle 
is in the form of (1) for E^ =1 (x a - - |x)’ ， 2; in each case h{x v ... r x N ) 

=1. The right-hand side is in the form of (1) for x 7 |x with h(x l7 ... y x N )= 
exp {- \(N - l)tr 2 -1 5}. ■ 

Note that if 2 is given, x is sufficient for |x, but if |x is given, S is not 
sufficient for 2. 
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Completeness 

To prove an optimality property of the r 2 -test (Section 5.5), we need the 
result that (x,S) is a complete sufficient set of statistics for (|x, 2). 

Definition 3.4,1* A family of distributions ofy indexed by Q is complete if 
for every real-valued function g(y) t 

(3) <^ e g(y) s ° 

identically in 0 implies g(y) ^0 except for a set of y of probability 0 for every 0. 

If the family of distributions of a sufficient set of statistics is complete, the 
set is called a complete sufficient set. 

Theorem 3.4.2. The sufficient set of statistics x,S is complete for X when 
the sample is drawn from N(\x, 2). 


Proof. We can define the sample in terms of x and z n as in Section 

3.3 with ai = N — 1. We assume for any function g(x, A) n5) that 

(4) / … //f|2|— iz a z' a ^ 

■expj-^ N(x- - |i) + E z’ a 2 _l z a I 

n 

■dx n dz a ^0, V|a,2, 

a = l 

where K = yfNdx= nf =1 dx iy and dz a ^ nf =1 dz ia . If we let 2' 1 
=/ - 2®, where 0 = ©' and / —20 is positive definite, and let jx = 
(I — 20) -1 /, then (4) is 

(5) Os / -•/A ： |/-2®|^g(i, Ez a z；] 


.exp -会 tr ( 卜 2€>) E z a z' a + Mxx' 


■2Nt'x + Nt'{I~2&y l t \dxY[dz a 


|J —— … fg(x,B-Nxx') 

n n 

_exp[tr +r(N^)]n[i|0 ，（ l/N)J] n n(z„|0,/)dx fl dz a . 
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where B = +Nxx'. Thus 

(6) 0^<^g(x y B-Nxx r ) exp[tr ©B + 〆 （ NJ)1 

- j jg(x y B- Nxx r ) exp[tr + 〆 （ NJ)] h(x y B) dxdB y 


where h(x y B) is the joint density of x and B and dB db iy The 

right-hand side of (6) is the Laplace transform of g(JE, B — Nxx^hix, B). 
Since this is 0, g(JE, A)-0 except for a set of measure 0. ■ 

Efficiency 

If a ^-component random vector Y has mean vector SY = v and covariance 
matrix ^(Y - vXY- v) r = then 


(7) (j - ^ - v) =q + 2 

is called the concentration ellipsoid of Y. [See Cramer (1946), p. 300.1 The 
density defined by a uniform distribution over the interior of this ellipsoid 
has the same mean vector and covariance matrix as Y. (See Problem 2.14.) 
Let 0 be a vector of q parameters in a distribution, and let ^ be a vector of 
unbiased estimators (that is, St 0) based on N observations from that 
distribution with covariance matrix Then the ellipsoid 

(8) 卽今(卜 0) 十 2 

lies entirely within the ellipsoid of concentration oi t\ d log // 汐 0 denotes the 
column vector of derivatives of the density of the distribution (or probability 
function) with respect to the components of 0. The discussion by Cramer 
(1946, p. 495) is in terms of scalar observations, but it is clear that it holds 
true for vector observations. If (8) is the ellipsoid of concentration of r, then 
t is said to be efficient. In general, the ratio of the volume of (8) to that of 
the ellipsoid of concentration defines the efficiency of In the case of the 
multivariate normal distribution, if 0 = fx, then x is efficient. If 0 includes 
both (jl and 2, then x and S have efficiency [(N - Under 

suitable regularity conditions, which are satisfied by the multivariate normal 
distribution, 


(9) 


㈣ 


This is the information matrix for one observation. The Cramer^Rao lower 
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bound is that for any unbiased estimator t the matrix 

(It)) 财 (卜 0)( 卜 0)’- ~ 

is positive semidefinite. (Other lower bounds can also be given.) 

Consistency 

Definition 3.4.2. A sequence of vectors t n ― t mn )\ m = 1,2,..., is 

a consistent estimator of 0 = ( 士， • • • ， 8 川 ) ’ if plim^ ^t in — 6 n / = 1， _ _ _ ， m. 

By the law of large numbers each component of the sample mean jc is a 
consistent estimator of that component of the vector of expected values |ji if 
the observation vectors are independently and identically distributed with 
mean (jl ，and hence jc is a consistent estimator of (Ji. Normality is not 
involved. 

An element of the sample covariance matrix is 

1 N N 

(n) jfzrj E (x,a~ - 

ti — I 

by Lemma 3.2.1 with = jjl. The probability limit of the second term is 0. 
The probability limit of the first term is a l} if x u x 2 ,- ^ are independently 
and identically distributed with mean (jt and covariance matrix 2. Then S is 
a consistent estimator of 2. 

Asymptotic Normality 

First we prove a multivariate central limit theorem. 

Theorem 3.4.3. Let the m-component vectors independently 

and identically distributed with means S > Y a = v and covariance matrices 
― vX^ a v ) 1 = T. Then the limiting distribution of (l/'/n)'L n asf \(Y a — v) 
as n — oo is N(m 

Proof. Let 

i ” ' 

(12) d>„(f,u) = «^exp - v ) » 

yri 

where u is a scalar and t an m-component vector. For fixed t, (f) tJ (t y u) can be 
considered as the characteristic function of (1/y/n)H n aes i(t T Y a — ^t r Y a ). By 
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the univariate central limit theorem [Cramer (1946), p. 215]，the limiting 
distribution is N(G y t r n). Therefore (Theorem 2.64), 

(13) lim 4> n (t,u)^e-'^ rTt 

n-»co 

for every u and t. (For ^ = 0 a special and obvious argument is used.) Let 
u = 1 to obtain 


(14) 


lim <^exp 


it r 


7T 


E dv) 




for every t. Since e 一 ^ t，Tt is continuous at ^ = 0, the convergence is uniform in 
some neighborhood of 丈 = 0. The theorem follows. ■ 


Now we wish to show that the sample covariance matrix is asymptotically 
normally distributed as the sample size increases. 

Theorem 3.4.4. Let A(n) = Ea=j(^ a -X N )\ where … 

are independently distributed according to iV(jx, X) and n = N — L Then the 
limiting distribution of B(n) = (1/ yfn )[/l(n) - n^] is normal with mean 0 and 
covariances 

(15) 叫 » 〜 (n) = o- lk a fl + a a a fk . 

Proof. As shown earlier, A(n) is distributed as A(n) = T. n aml Z a Z T a1 where 
Z h Z 2 ，•“ are distributed independently according to jV( 0,2). We arrange 
the elements of Z a Z r a in a vector such as 

(z 2 \ 

( 16 ) K- 7 2 - 


l ^ J 

the moments of Y a can be deduced from the moments of Z a as given in 
Section 2.6. We have ^Z la Z ja = a tj , SZ ia Z ]a Z ka Z la = + a ik aj, + 

^(Z ia Z ja - (Tii){Z ka Z la - a kl ) = a- ik a jT + a a a jk . Thus the vectors Y a 
defined by (16) satisfy the conditions of Theorem 3.4.3 with the elements 
of v being the elements of 2 arranged in vector form similar to (16) 
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and the elements of T being given above. If the elements of A{n) are 
arranged in vector form similar to (16)，say the vector W(n) y then W(n) - nv 
= - v). By Theorem 3.4.3, (1 /yfn )[ 灰 (《) - nv] has a limiting normal 
distribution with mean 0 and the covariance matrix of 1^. VI 

The elements of B(n) will have a limiting normal distribution with mean .0 
if : c 】， x 2 ，… are independently and identically distributed with finite fourth- 
order moments, bui the covariance structure of B{n) will depend on the 
fourth-order moments. 

3.4.2. Decision Theory 

It may be enlightening to consider estimation in teims of decision theory. We 
review some of the concepts. An observation x is made on a random variable 
X (which may be a vector) whose distribution P 9 depends on a parameter 9 
which is an element of a set ©. The statistician is to make a decision d in a 
set D. A decision procedure is a function 8(x) whose domain is the set of 
values of X and whose range is Z). The loss in making decision d when the 
distribution is P Q is a nonnegative function L(d, d). The evaluation of a 
procedure 8(x) is on the basis of the risk function 

(17) R(9,8)=^ i e L[d,8(X)]. 

For example, if d and 6 are univariate, the loss may be squared error, 
L(d, rf) = (0 - d) 2 , and the risk is the mean squared error 々 [5( 义）一 0] 2 . 

A decision procedure 5(^) is as good as a procedure d^ix) if 

(18) R(e,d) </?(e,5*), ve ； 

8(x) is better than 8*(x) !f (18) holds with a strict inequality for at least one 
value of 0. A procedure is inadmissible if there exists another proce¬ 

dure 8(x) that is better than 8*(x). A procedure is admissible if it is not 
inadmissible (i.e.，if there is no procedure better than it) in terms of the given 
loss function. A class of procedures is complete if for any procedure not in 
the class there is a better procedure in the class. The class is minimal 
complete if it does not contain a proper complete subclass. If a minimal 
complete class exists, it is identical to the class of admissible procedures. 
When such a class is available, there is no (mathematical) need to use a 
procedure outside the minimal complete class. Sometimes it is convenient to 
refer to an essentially complete class, which is a class of procedures such that 
for every procedure outside the class there is one in the class that is just as 
good. 
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For a given procedure the risk function is a function of the parameter. If 
the parameter can be assigned an a priori distribution, say, with density p(fl), 
th3n the average loss from use of a decision procedure S(x) is 

(19) r{ p,8) = ^ p R(e,8)^<^L[e,8(X)]. 

Given the a priori density p, the decision procedure 8(x) that minimizes 
r(p, 8) is the Bayes procedure, and the resulting minimum of r( p, 6) is the 
Bayes risk. Under general conditions Bayes procedures are admissible and 
admissible procedures are Bayes or limits of Bayes procedures. 11* the density 
of X given d is f(x\ 6), the joint density of X and 0 is f{x\ 6)p(6) and the 
average risk of a procedure S(.t) is 

(20) r( p,5)= ( f L[e,S(x)]f(x\d)p(e)dxde 

=f^fj-[d,8(x)]g(d\x)dd^jf(x) dx ： 

here 

(21) f(x)-f & f(x\e) P (d)de, g (如卜 

are the marginal density of X and the a posteriori density of 0 given x. The 
procedure that minimizes r( p, 8) is one that for each x minimizes the 
expression in braces on the right-hand side or (20), that is s ihc expcciation of 
L[8, 5( 太 )] with respect to the a posteriori distribution. If 6 and d are vectors 
(0 and d) and L(0, d) = (0 一 d)’0(0 -d\ where Q is positive definite, then 

(22) ‘ x L[e ， d ⑺卜 / eix [e-/(ek)]'e[e-/(eu)] 

+ [ ^(ek) - d{x)] f Q[ c^(d\x) - d (^)]. 

The minimum occurs at d(x) #(0U)，the mean of the a posteriori distribu¬ 
tion. 

Theorem 34.5. Ifx lyt ..,x N are independently distributed, each x a accord¬ 
ing to M|Jt ， S )，and if \i has an a priori distribution N(v, then the a 
posteriori distribution of given x { ,. r .,x N is normal with mean 

(23) *(* + ^5：) + 'v 

and covariance matrix 

(24) + » 
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Proof Since x is sufficient for |x，we need only consider x, which has the 
distribution of |x + r, where v has the distribution A^[0,(l/A^)2] and is 
independent of |x. Then the joint distribution of |x and x is 





'v) 


0) o 1 


(25) 

N 


V 





The mean of the conditional distribution of |Ji given x is (by Theorem 2.5.1) 

(26) v + ~5；J (x-v), 

which reduces to (23). ■ 

Corollary 3AL If are independently distributed, each x a ac¬ 

cording to Mp *， X), fi- has an a priori distribution N{v, O )， and the loss function 
is {d — \^) r Q(d - (X )， then the Bayes estimator of jx is (23). 

The Bayes estimator of 卜 is a kind of weighted average of x and v, the 
prior mean of |x. If (1/N)1 is small compared to O (e.g., if N is large), v is 
given little weight. Put another way, if O is large, that is, the prior is 
relatively uninformative, a large weight is put on x. In fact, as O tends to oo 
in the sense that O— 1 0, the estimator approaches x. 

A decision procedure 5 0 ( 尤 ） is minimax if 

(27) supi?( d, 5 0 ) ^ inf supi?( 0, 5). 

9 5 e 

Theorem 3.4.6. If x lJ ...,x N are independently distributed each according 
to 2) and the loss function is (d — |x)'g(rf — jjl )，then x is a minimax 
estimator. 

Proof 、 This follows from a theorem in statistical decision theory that if a 
procedure 5 0 is extended Bayes [i.e.，if for arbitrary s, r( p, 5 0 ) < r( p, S p ) + s 
for suitable p, where S p is the corresponding Bayes procedure] and if 
R(6, S 0 ) is constant, then 5 0 is minimax. [See, e.g., Ferguson (1967) ， Theo¬ 
rem 3 of Section 2JlJ We find 

(28) i?(|A,x) - S{x- yi)'Q{x- [k) 

=^ trg(x- |a)(x- |a) ; 
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Let (23) be d(x). Its average risk is 


(29) 


tr Q 






tr0^ ^> + ^2 f 


.r 1 


tr0 /+^24> . ^2-^tr02 


as 中 — 0. 


■ 


For more discussion of decision theory see Ferguson (1967). DeGroot 
(1970)，or Berger (1980b). 


3.5. IMPROVED ESTIMATION OF THE MEAN 


3.5*1. Introduction 

The sample mean x seems the natural estimator of the population mean (i 
based on a sample from It is the maximum likelihood estimator, a 

sufficient statistic when 2 is known, and the minimum variance unbiased 
estimator. Moreover, it is equioariant in the sense that if an arbitrary vector v 
is added to each observation vector and to ii, the error of estimation 
(x + v) — (jjl + v) 一 ！ jl is independent of v; in other words, the error 
does not depend on the choice of origin. However，Stein (1956b) showed the 
startling fact that this conventional estimator is not admissible with respect to 
the loss function that is the sum of mean squared errors of the components 
v/hen S —/ and p > 3. James and Stein (1961) produced an estimator which 
has a smaller sum of mean squared errors; this estimator will be studied in 
Section 3.5.2. Subsequent studies have shown that the phenomenon is 
widespread and the implications imperative. 

3*5.2. The James-Stein Estimator 
The loss function 

(1) = (m - L (m, - ii c ) 2 = \\m - »i|| 2 

l=\ 

is the sum of mean squared errors of the components of the estimator. We 
shall show [James and Stein (1961)] that the sample mean is inadmissible by 
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displaying an alternative estimator that has a smaller expected loss for every 
mean vector (ju We assume that the normal distribution sampled has covari¬ 
ance matrix proportional to I with the constant of proportionality known. It 
will be convenient to take this constant to be such that F= (1 /N)LD a 
has the distribution N(\x 7 /). Then the expected loss or risk of the estimator 
Y is simply (jt|| 2 = trl=p. The estimator proposed by James and Stein 

is (essentially) 

( 2 ) m (y) - ^ 

where v is an arbitrary fixed vector and p>3. This estimator shrinks the 
observed y toward the specified v. The amount of shrinkage is negligible if y 
is very different from v and is considerable if y is close to v. In Ihis sense v 
is a favored point. 

Theorem 3.5.1. With respect to the loss function (1), the risk of the estima¬ 
tor (2) is less than the risk of the estimator Y for p >3. 

We shall show that the risk of F minus the risk of (2) is positive by 
applying the following lemma due to Stein (1974). 

Lemma 3.5.1. is a function such that 

(3) fy'(x)dx 

for all a and b (a <b) and if 

(4) f l/'(^)l ^ <o °- 

then 

1 


(5) / f(x){x 

—00 


/ ⑺士 —一〜 
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Proof of Lemma. We write the left-hand side of (5) as 


( 6 ) 


-CO 1 

f e [f(x)-f{e)}{x-d)-j^e~^-^dx 


/ if( X ) -/( 0 )]( 文 - 
^ -00 \27T 


i(x-d ) 2 


dx 




f\y){x- d)-f=e-^ x - B) 'dydx 
f e J e v27t 


f'{y)(x- e)^=e~'- ix ~ 6)Z dydx 


J x 


f'(y)(x-e) 1 =e -^- 6)2 dxdy 
r e J y y27T 


s ry 


f\y){x- e)-j=e^^ e)l dxdy. 


which yields the right-hand side of (5). Fubini’s theorem justifies the inter¬ 
change of order of integration. (See Problem 3,22.) ■ 

The lemma can also be derived by integration by parts in special cases. 
Proof of Theorem 3.5 J, The difference in risks is 


(7) AK(jx) = 4j||y - fill 2 - \\m(Y)- fill 2 } 


S.. 


IlF- ^l| 2 -jj|l 


P~2 

IlF-vlP 


(F — V) + V — fx 


EdM,) 2 - E 




+ 爲 , 一 ㈣ 
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Now we use Lemma 3.5.1 with 

( 8 ) /(V,) = —-— ./'(y,) = — ~ - 

L L ( y ；~ v ；) 2 

■/ 二丨 


2 ( 兄 - v t f 

L (n) 2 

y=i 


[For p > 3 the condition (4) is satisfied.] Then (7) is 


(9) A^(jt)-^2(p-2) I ： 

=(p - 2 ) 2 々 


ll^-vll 2 


ll^-vll 2 

> 0. 


2( >M) 
lir-v || 4 

■ 


(P-2)M 

in 


This theorem states that } is inadmissible for estimating |jl when p > 3, 
since the estimator (2) has a smaller risk for every jjl (regardless of the choice 
of v). 

The risk is the sum of the mean squared errors - Since 

Y ]m . ..,Y p are independent and only the distribution of Y t depends on /x” it is 
puzzling that the improved estimator uses all the ”’s to estimate ix t ; it seems 
that irrelevant information is being used. Stein explained the phenomenon by 
arguing that the sample distance squared of Y from v，that is, ||Y—v|| 2 , 
overestimates the squared distance of jjl from v and hence that the estimator 
Y could be improved by bringing it nearer v (v r hatever v is). Berger (1980a), 
following Brown, illustrated by Figure 3.4 - The four points x v x 2 ,x 3y x 4 
represent a spherical distribution centered at |x. Consider the effects of 
shrinkage. The average distance of m(x { ) and m(x 3 ) from |x is a little greater 
than that of x } and x 3 , but m{x 2 ) and m(x 4 ) are a little closer to |jl than x 2 
and x A are if the shrinkage is a certain amount. If p = 3, there are two more 
points (not on the line v,|jl) that are shrunk closer to jjl. 



m(X2)^ r 


mtjca) 

m{x\) 


V- 

*3 



〜 A 



Figure 3.4. Effect of shrinkage. 
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The risk of the estimator (2) is 


(10) ^IKF)-^|| 2 =p-(p- 

where ||F—v|| 2 has a noncentral x 2 'distribution with p degrees of freedom 
and noncentrality parameter 11m- — v|| 2 . The farther (jl is from v, the less the 
improvement due to the James - Stein estimator, but there is always some 
improvement, The density of ||K- v|| 2 = V, say, is (28) of Section 3.3.3, where 
t 2 = lip - v|| 2 . Then 


( 11 ) 


Nir-vir 


w 1 


|3 — 0 
oo 


T 


P !r ( 士 p + J 0 


u b+e-2 e -^ du 


T ) ~~ wijp+w~~ 


a 


k 十 E 


l 




for p >3. Note that for p, — v, that is, x 2 = 0, (11) is l/(p — 2) and the mean 
squared error (10) is 2. For large p the reduction in risk is considerable. 

Table 3.2 gives values of the risk for p = 10 and a 2 - 1. For example, if 
r 2 = IlfJi ~ v|| 2 is 5, the mean squared error of the James—Stein estimator is 
8.86, compared to 10 for the natural estimator; this is the case if ^ 一 ~ = 
1/v^2 = 0.707, i = 1 ， …， 10, for instance. 


Table 3.2' Average Mean Squared Error of the 
James-Steln Estimator for p 10 and <r 2 = 1 


T 2 = IlfJL- v || 2 


0.0 

2.00 

0.5 

4.78 

1.0 

6.21 

2.0 

7.5 J 

3.0 

8.24 

4.0 

8.62 

5.0 

,8-86 

6.0 

9-03 


+ From Efron and Morris (1977). 
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An obvious question in using an estimator of this class is how to choose 
the vector v toward which the observed mean vector is shrunk; any v yields 
an estimator better than the natural one. However, as seen from Table 3,2 ， 
the improvement is small if ||jjl — v|| is very large. Thus，to be effective some 
knowledge of the position of jjl is necessary. A disadvantage of the procedure 
is that it is not objective; the choice of v is up to the investigator. 

A feature of the estimator we have been studying that seems disadvanta¬ 
geous is that for small values of IlK—v||，the multiplier of Y- v is negative; 
that is, the estimator m(Y) is in the direction from v opposite to that of Y. 
This disadvantage can be overcome and the estimator improved by replacing 
the factor by 0 when the factor is negative. 

Definition 3.5.1, For any function g(u), let 

(12) g + (u) =g(u), g(u) >0, 

= o, g(u) <0 ‘ 

Lemma 3.5.2 ， When X is distributed according to N(p ， I\ 

(13) ^(llg + (||^||)^-nll 2 }< ^(Hg(ll^ll)Jt-jtll 2 }. 

Proof. The right-hand side of (13) minus the left-hand side is 

(14) ^ 2 (II^||)||X|| 2 - [g + (||X||)] 2 ||X|| 2 } k 0 
plus 2 times 

(15) ^ii'Xlg^WXlD-giWXW)] 

= IWl/ … / : ViU + (">") 

^ —00 ^ —CO 

- ..\' exp[-| E W nil + lljt" 2 1^. 

where y' =x r P, (||jt|| ， 0, … ， 0) = \L r P, and PP' =7. [The first column of P is 
(I/Hm-IDm--] Then (15) is IIp|| times 

( 16 ) e - * 11 f -g(ll ： yll)][e ll|lll>l 

-oo ^ -oo^o 

•- ~ dy x dy 2 • * - dy p > 0 
(by replacing y x by -y l for y l < 0). ■ 
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Theorem 3.5.2. The estimator 

P - 2 ) + 

(17) m + {y) - 1- - - -r {y-^) +v 

II 夕一 vl| J 

has smaller risk than m(y) defined by (2) and is minimax. 


Proof. In Lemma 3.5.2, let g(u) = 1 — — 2)/u 2 and X—Y-v y and 

replace |x by |x — v. The second assertion in the theorem follows from 
Theorem 3.4.6. ■ 


The theorem shows that m(F) is not admissible. However, it is known that 
m + (Y) is also not admissible, but it is believed that not much further 
improvement is possible. 

This approach is easily extended to the case where one observes 〜 … ，尤、 
from N(jx, 2) with loss function L(jx ， m) = (m — — jx)_ Let 2 = 

CC for some nonsingular C ， x a = Cx*, a ： = 1 ， . • • ， iV ， |x = Cjx *， and 
|x*) = ||m* — jx*|| 2 . Then …，； are observations from Mjx*./), 
and the problem is reduced to the earlier one. Then 

(18) ( l ~ Nrx - v ) r ^( x - V )) (卜和 
is a minimax estimator of (x. 

3,5,3* Estimation for a General Known Covariance Matrix and an 
Arbitrary Quadratic Loss Function 

Let the parent distribution be N(jx ， X)， where 2 is known, and let the loss 
function be 

(19) L(|x,m) (m - |x) ， 6(m - jx), 

where Q is an arbitrary positive definite matrix which reflects the relative 
importance of errors in different directions. (If the loss function were 
singular, the dimensionality of x could be reduced so as to make the loss 
mai.rix nonsingular.) Then the sample mean x has the distribution 
AT(jx ，（ l/A02) and risk (expected loss) 


( 20 ) ^(x^^yQ(x-^)^<^trQ{x~^)(x~iLy-^tvQl, 

which is constant, not depending on jju 
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Several estimators that improve on x have been proposed. First we take 
up an estimator proposed independently bv Berger (1975) and Hudson 
(1974). 


Theorem 3.5.3. Let r( z \ 0 < c < oc, be a nondecreasing differentiable junc¬ 
tion such that 0 <r(z) <2(p — 2X Then for p>3 


( 21 ) 


m 



n(x-v)'1~ 1 Q- 1 1- 1 Jx-v) q 


2 — 1 v) + 


V 


Iras smaller risk than x and is minimax. 


Proof There exists a matrix C such that C f QC = / and (1/N)X = CAC' 
where A is diagonal with diagonal elements d 2 > > 8 p >0 (Theorem 

A.2.2 of the Appendix). Let x-Cy-Vv and |x = Cjx* -f v. Then y has the 
distribution A )， and the transformed loss function is 

(22) L*(ot*, ) - (m* - |x* )'(m* 一 ) = ||m* — |x*|| 2 . 

The cstimalor (21) ol* jx is translbrmcd to the estimator of - C _1 (|x ~ v\ 

( 23 ) 卜卜 义 : 2 / ) A - 1 卜 

We now proceed as in the proof of Theorem 3.5.1. The difference in risks 
between 夕 and m 5 * is 


(24) 




r(y ， A_ 2 y) 
~r , A- r r ~ 


LjYXy,-^)- 


r 2 (Y'^~ 2 Y) \ 
Y'A~ 2 Y / 


Since r{z) is differentiable, we use Lemma 3.5.1 with (x — d) = (_y, - /i*)8 i 


and 


(25) 

f(y.) 

(26) 



r(y'^~ 2 y) 

y'^- 2 y y>， 

rjy_^y) + 2r'{y'^- 2 y) yf _ 2r(j> , A~ 2 j>) >, 2 
jy'A- 2 jy y'^' 2 y S ； [ y '^- 2 yf 5, 2 
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Then 


(27) 






since r(y'^ 2 y) ^ 2(/7 - 2) and r t (y , A~ 2 y)> 0. ■ 

Corollary 3,5.1. Forp^：3 


(28) 


mm 


m[p - 2 f N 2 {x - v) f ^ l Q^ l X~ l (x -v)] 


N(x - vyi- l Q l ^ l (x - v) 


1 2 _l >(jf- v) + v 


has smaller risk than 又 and is minimax. 

Proof, the function r(z) = mfrKp — 2^ z) is differentiable except at z = 
p — 2. The function r(z) can be approximated arbitrarily closely by a differ¬ 
entiable function. (For example, the corner at z= p — 2 can be smoothed by 
a circular arc of arbitrary small radius.) We shall not give the details of the 
proof. ■ 

In canonical form y is shrunk by a scalar times a diagonal matrix. The 
larger the variance of a component is, the less the effect of the shrinkage. 

Berger (1975) has proved these results for a more general density, that is ， 
for a mixture of normals. Berger (1976) has also proved in the case of 
normality that if 


(29) 




a u ^- c e-^ uz du 


for 3 — 5 /? < c < 1 + where a is the smallest characteristic root of IQ, 
then the estimator m given by (21) is minimax, is admissible if c <2, and is 
proper Bayes if c < 1. 

Another approach to minimax estimators has been introduced by Bhat- 
tacharya (1966). Lei C be such that C" l (l/N')2(C _I y =/ and C f QC— Q* f 
which is diagonal with diagonal elements q* > q* > ■■- ^q* >0. Then y = 
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C~ l x has the distribution and the loss function is 

(30) Zqtim*-^) 2 

/ - 1 

=E E a ；( m * -mH 2 

i = I / = i 
P J 

=E 

j = I l 

/=1 

where a } =qf -qf +ly / = 1， _ • •，p 一 1 ， a p = q* y m* iJ) and 

… ， 〆 P’ ， /=l ， _“ ， p. This decomposition of the loss function 
suggests combining minimax estimators of the vectors jt* ( 乃 ， ;=Let 

Theorem 3.5.4. If h^Ky^^) = [h[ J Ky ( ^) 7 - -., h ( j } Ky ( ^)] r is a minimax esti¬ 
mator of under the loss function — |x* (y) || 2 , j = 1 ，…， then 

(31) i=l,...,p, 

is a minimax estimator of 〆 • 

Proof. First consider the randomized estimator defined by 

( 32 ) Pr(G,(^) =h ( i J) (y U) )) = 卜 “…， P, 

for the ith component. Then the risk of this estimator is 

(33) E 9f^[G,.(r)~ ^] 2 = Zqrz ^<^A h ' J) ( YU) )~^} 2 

i = l t = l j=i 1 

/ =1 卜 1 

=Ea y ^||A 0 ) (r 0 ) )~^ 0 ) ll 2 
/ =1 

^ E a jJ = E 

/-i y =l 

and hence the estimator defined by (32) is minimax. 
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Since the expected value of G t (Y) with respect to (32) is (31) and the loss 
function is convex, the risk of the estimator (31) is less than that of the 
randomized estimator (by Jensen’s inequality). ■ 


3.6. ELLIPTICALLY CONTOURED DISTRIBUTIONS 

3.6. L Observations Elliptically Contoured 

Let x^.. m ,x N be N ( = n + l) independent observations on a random vector 
X with density 1 八 | 〜 ^g[(x a - rY v)l. The density of the sample is 

(1) V- V )]. 

cr= 1 

The sample mean x and covariance matrix S = {l/n)VL N cx=l {x a — |xXx a — (x)' 
一 N(x — |x)(x — jjl)'] are unbiased estimators of the mean (x = v and the 
covariance matrix X = [o^R 2 /p]\, where 尺 2 = (jc — v)'A “（jc — v). 


Theorem 3.6,1. The covariances of the mean and covariance of a sample of 
Nfrom I A| ' k[(x — v) r A~ l {x-v)] with cnR A < oo are 

( 2 ) = 

(3) S{s tl - o- 1; )(i- |jl) =0, i,j= 

(4) 〜)(〜,-〜）= + U!i+ W 、 

+ 77(H + l. p. 


Lemma 3.6*1. The second-order moments of the elements of S are 


(5) <^S i} S kl = (T i} a k i + + ) » 

ij.kj, = 1, ..，，/? 


Proof of Lemma 3.6. L We have 

N 

( 6 ) ^ L ('Ct - K) (〜 -M/) 

二 N 虏 [X ia - 

+ N(N- 1) ^(x ia - ^)(x Ja - C^{x k(5 - 

= ^(1 4 - + + +N{N - l)cr fJ (T kh 
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(7) - /X,)( 无厂 6)(A - O(hi) 

1 s! 

=TIT ^ E ^,)( x ,n~ ^){ X ky~ ^y)( X l6~ ^l) 

a, ^. y, fi * 1 

二士 U + K )( C*/ + u 】 i+ 

N 一 1 、 

+ ~ N ~( <r ” <rjt '+ ， 

N 1 N 

⑻ ^ T. ( x >a~ ^,)( x ,u-^ } )jj E ( x kfi ~ ^k)( x ly ~ As) 

cr^ 1 /3* 了 《=1 

=(1 + K){(T l] (T kt + a^,+ (T lt (T ]k ) + (N- l)<T 0 (J k/ . ■ 

It will be convenient to use more matrix algebra. Define vec B y B ® C (the 
Kronecker product), and K mn (the commutator matrix) by 

I 占 1 ] 

(9) vecB = vec(6 1 ,,.. > fe w )= :， 

h n m 

Z> n C b ln C^ 

(ID) B®C-= \ [ 

_K\ C '•* b mn C ^ 

(11) K mn vec 7? = vec B f . 

See. e.g., Magnus and Neudecker (1979) or Section A.5 of the Appendix. We 
can rewrite (4) as 


(12) (vec5) = <^(vec5 — vec X)(vec5 - vec X)' 


rtK-\- N 
nN 


+ K pp )(l®^) + 备 vecS^vecS)'. 


Theorem 3.6.2 


(13) ^ 


( 卜 ㈡ 

vec‘S — vec 2 


N 


0 

0 


1 0 
0 (« + l)(/ p : + X pn )(l0 5；) + KVecS( V ecX)' 
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This theorem follows from the central limit theorem for independent 
identically distributed random vectors (with finite fourth moments). The 
theorem forms the basis for large-sample inference. 


3.6.2* Estimation of the Kurtosis Parameter 

To apply the large-sample distribution theory derived for normal distribu¬ 
tions to problems of inference for elliptically contoured distributions it is 
necessary to know or estimate the kurtosis parameter k. Note that 


(14) 


{ <^R Z ) 

p 2 SR i 

{^R 2 ) 


S{Y'Yy 

2 = P{P + 2)(1 + k). 


Since x n ，and S ' 2, 


( 15 ) JJ E 无 ) - s )] 2 二 /Kp + 2 )(i + K). 

a ^ I 

A consistent estimator of k is 


(! 6 ) ^ p ( p ^ 2 ) jjL [(x^xyS-'ix^x)} 2 -!. 

Mardia (1970) proposed using M to form a consistent estimator of k. 


3*63, Maximum Likelihood Estimation 

We have considered using 5 as an estimator of 2 = (<^i? 2 /p)A. When the 
parent distribution is normal, S is the sufficient statistic invariant with 
respect to translations and hence is the efficient unbiased estimator. Now we 
study other estimators. 

We consider first the maximum likelihood estimators of and A when 
the form of the density g(0 is known. The logarithm of the likelihood 
function is 

(17) logL = - yloglAl + L logg[(;c a -(n)]. 
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The derivatives of log L with respect to the components of jjl are 


(18) 




m-) A _i (a:„ - jt)] a ^ 


Setting the vector of derivatives equal to 0 leads to the equation 
(19) 

« = 1 g [(〜 一 AO, 人 -1 ( 〜一 k)l 〜 M-)T 


Setting equal to 0 the derivatives of log L with respect to the elements of 
八一丨 gives 


2 N 


r r A \ ^ _ 1 y A \ 1 


The estimator A is a kind of weighted average of the rank 1 matrices 
( 文 a 一 - jl)’. In the normal case the weights are 1/N. In most cases 

(19) and (20) cannot be solved explicitly, but the solution may be approxi¬ 
mated by iterative methods. 

The covariance matrix of the limiting normal distribution of ^/N (vec A - 
vec A) is 


(21) #(vec A) = a lg (I p 2 + K pp )(\<S> \) + (j 2s vec A (vec A )、 
where 



(23) 




2〜(1 - ( 7 , g ) 

2+p(l - or lw ) 


See Tyler (1982). 


3.6,4. Elliptically Contoured Matrix Distributions 


Let 


(24) 
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be an ATXp random matrix with density g(K # K) =g(E^, y a ). Note that 
the density g(FT) is invariant with respect to orthogonal transformations 
Y* = O n Y. Such densities are known as left spherical matrix densities. An 
example is the density of N observations from MO, I p ) 9 

( 2 订) 

In this example Y is also right spherical. YO p = Y, When Y is both left 
spherical and right spherical, It is known as spherical Further, if Y has the 
density (25), vec Y is spherical; in general if Y has a density, the density is of 
the form 

(26) g(tTr)=g E E^)=g(trFr) 

= g[(vec F)'vec Y ] — g[(vec vec . 

We call this model vector-spherical. Define 


(27) X=YC 

where C*\^ x C — l p and z f N = 1). Since (27) is equivalent to y = 

(X— e N [L f ){C r y 1 and (CO 一 1 C 一 1 = A 叫， the matrix X has the density 

(28) |Ar w/2 g[tr(^-£ A ,^)A^(^-e w ^)-] 

= \M~ N/Z g E ( 〜 -MO’ 八一 V,,-K) . 

.Ct=l 

From (26) we deduce that vec Y has the representation 

(29) vec K=/? vec U y 
where w ^ R 2 has the density 

kNp 

( 30 ) ㈨ ， 

vec U has the uniform distribution on - 1, and R and vec 0* are 

independent. The covariance matrix of vec Y is 

d 2 r^l 

(31) <^vec F(vec 广 
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Since vec FGH = (H' ® F)vec G for any conformable matrices F, G, and H, 
we can write (27) as 

(32) vec X= (C <S> I N )\ecY + jjl® e N . 

Thus 

(33) S vec A"= M- ® e w , 

(34) ^^ecX) = {C®I N )^^ecY)(C' <S/ W ) =-^-A ®/ w , 

(35) <^(row of A") == jjl 7 , 

(36) if(row of = 

The rows of X are uncorrelated (though not necessarily independent). From 
(32) we obtain 

(37) vec X = 尺 (C ^ I N ) vec U + 11 ^ e N9 

(38) X^RUC 1 

Since A" - = (X- e^x 7 ) + ^ N (X- jt) 7 and e ;( 尤一 e"i’）= 0, we can 

write the density of X as 

(39) Wr^^ltTA-^x-e^'y^X-e^x^+Nix-iiyA-^x-ii)], 

where x = {\/N)X 1 z n . This shows that a sufficient set of statistics for p and 
A is i and nS^{X— z N x l ) l {X- z N x f ) } as for the normal distribution. The 
maximum likelihood estimators can be derived from the following theorem, 
which will be used later for other models. 

Theorem 3.6.3, Suppose the m-component vector Z has the density 
|^>| ~ ^h[(z - v)’ 中 _ l (z — v )]， where w^ m h(w) has a finite positive maximum at 
w fl and 电 is a positive definite matrix. Let il be a set in the space of (v ， 中） 
such that if (V, 中 ） e fi then (v，c 屯） e SI for all c > 0. Suppose that on the 
basis of an observation z when h(w) — const e~ lW (i.e., Z has a normal 
distribution) the maximum likelihood estimator (v, <1>) e 11 exists and is unique 
with ^ positive definite with probability 1. Then the maximum likelihood 
estimator of (v 9 <1>) for arbitrary /?(*) is 

V = Vy ^ == — 


(40) 
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and the maximum of the likelihood w |4>| ~ '/i(w /( ) [Anderson, Fang, and Hsu 


(1986)]. 

Proof. Let 'P — |<I>| ~ and 


(41) 


rf=(z-v) , «I>' l (z-v) = 


(z - - v) 


Then (v, <J>)en and | 屮 | = 1. The likelihood is 


(42) [(z-v) , ^- 1 (z-v)] _ ^ d>h(d). 

Under normality h(d) = (2tt)~ ' 1，n e~ ^ d t and the maximum of (42) is attained 
at v — v t ^ ^ - 14>| _1/ m *I>, and d = m. For arbitrary /?(■) the maximum 

of (42) is attained at v — v, B =B > and a = w h . Then the maximum likeli¬ 
hood estimator of is 


(43) 4>=| 如 1 〜 = 

Then (40) follows from (43) by use of (41). ■ 

Theorem 3.6.4. Let X (N X p) have the density (28), where w^ Np g(w) has 
a finite positive maximum at w g . Then the maximum likelihood estimators of jjl 
and A are 

(44) A = i ， A = 

s 

where A ^ - i)'. 

Corollary 3.6,1. Let X iN X p) have the density (28). Then the maximum 
likelihood estimators of v ，• •，入 即)， and p tJ7 i ，/ = l ”•_，；?， are x, 

and iyj = i， …， /?• 

Proof. Corollary 3.6,1 follows from Theorem 3/).3 and Corollary 3.2.1 . 鼸 

Theorem 3.6.5» Let f(X) be a vector-valued function of X {NXp) such 
that 


(45) 


/(he〆 ） ^f(X) 
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for all v and 

(46) f{cX)=f{X) 

for all c* Then the distribution off(X) where X his an arbitrary density (28) is 
the same as its distribution where X has the normal density (28). 

Proof. Substitution of the representation (27) into f(X) gives 

(47) fix) =f(YC f + E〆) 

by (45). Let f(X)^ h(yecX). Then by (46 )， h(cX) = h(X) and 

(48) /(KC f ) =A[(C0/ /v )vecK] =*[i?(C0/ w )vecf/] 

= A[(C^/ W )vecf/]. ■ 

Any statistic satisfying (45) and (46) has the same distribution for all g(*X 
Hence, if its distribution is known for the normal case, the distribution is 
valid for all elliptically contoured distributions. 

Any function of the sufficient set of statistics that is translation-invariant, 
that is, that satisfies (45)，is a function of S. Thus inference concerning X can 
be based on S. 

Corollary 3.6.2. Let f(X) be a vector-valued function of X (NXp) such 
that (46) holds for all c. Then the distribution of f(X) where X has arbitrary 
density (28) with = is the same as its distribution where X has normal density 
(28) with 0. 

Fang and Zhang (1990) give this corollary as Theorem 2.5.8* 


PROBLEMS 

3-1. (Sec k 3.2) Find pu, X, and ( p tJ ) for the data given in Table 3.3, taken from 
Frets (1921). 

3.2. (Sec. 3.2) Verify the numerical results of (21). 

3.3, (Sec. 32) Compute pi ， X ， S，and p for the following pairs of observations: 
(34,55), (12,29), (33 ? 75), (44 ? 89), (89,62), (59,69), (50,41), (88,67). Plot the obser¬ 
vations. 

3-4 - (Sec* 3.2) Use the facts that | C*| = n A p tr C* = L\ l9 and C* = I if = … 
=\ p = where A 卜 ".，are the characteristic roots of C*, to prove Lemma 
3,2,2, [Hint: Use f as given in (12).} 



T These data，used in examples in the first edition of this book, came from Rao 
(1952) ， p. 245. Izenman (1980) has indicated some entries were apparently 
incorrectly copied from Frets (1921) and corrected them (p. 579), 


3.S. (Sec, 3,2) Let x y be the body weight (in kilograms) of a cat and x 2 the heart 
weight (in grams). [Data from Fisher (1947b).] 

(a) In a sample of 47 female cats the relevant data are 


a 1432.5 r 


x = 265.13 1029.62 

°~ 1029.62 4064.71 
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Table 3J 1 . Head Lengths and Breadths of Brothers 


Head 

Length, 

First Son ? 

x i 

Head 
Breadth, 
First Son, 

^2 

Head 
Length’ 
Second Son, 

又 3 

Head 
Breadth ， 
Second Son, 

191 

155 

179 

145 

195 

149 

201 

152 

181 

148 

185 

149 

183 

153 

188 

149 

176 

144 

171 

142 



Find (i ， 2, S，and p. 
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Table 3.4 - Four Measurements on Three Species of Iris (in centimeters) 

Iris setosa Iris uersicolor Iris uirginica 

Sepal Sepal Petal Petal Sepal Sepal Petal Petal Sepal Sepal Petal Petal 

length width length width length width length width Length width length width 


5 9 18 2 17 8 8 5 0 9 10 4 3 8 2 3 5 3.0.0. 8.1.88.8J.69.0.2J.4.34,88a 

2.1*2.1 .2.2.1.1.1.2*2‘ 12.2.2.2.1 ,2.2,1 .2 2 2 12 1 1 丨 2 1 I 2 2 11 2 2 112 

01968 65381 13 5 0 1 3 5-7 0-0..7.9.7.9.7.0.8.9.68 J.4.6J.64.6.5k84. 
6.5.5.*^5.6,4.6.5.6.5.5.5.5.5 5 5 665 54645 6 4 4 5 5 <06555 65545 

3 7 0 9 0 0 5 9 5 6 2.70.5.82.o.8.6.2.2.8.8.7.32.8.08.o09.8.8.8,6 o. A 1. o. 1. 
3.2*3.2.3.3.2.2.2.3*3.2.3.2.2.3,3.3.2.2.z 2 2 3 3 2 3 2 3 2 3 2 2 2 3 3 3 3 3 

3 8 13 5 6 9 3 7 2 5 4 8 7 8 4 5 7 7 0 967.37.2.2 —； 4-.2.4.Q.4.3. 1.7.3.4.0.9 
6.5.7.6.6.7.4.7.6.7*6.6,6,5.5.6.6.7.7.6.6.5.7.6 6* 7 6 6 6 7 7 7 6 6 6 7 6 6 6 6 

.4ITJ 53J 3.6.0.34.05.04.34 5.0w^.loq.3.S.2.34.4.7.50 .02.65.65* 3.33 

1 .1 .L I L 1 • ， — Il.l.1*1 .Ll.LL11 .hll.1111] 1 I I 1 I 111 11 11111 

7 5 9 0 6 5 7.36,9.5,2.o.7.6.4j 1 ， j.9.8.0.97.3.4.8Q. j.8J.9—^ 5..5J.4J.Q 
4.4.4,4.4.4.4.3.4.3,3*4,4.4.3.4.4.4,4.3.4*4.4* 4* 4.4.4.S.43.3.3 3 5 4 4 4 4 4 4 

2 2 13 8 8 3 4 9 7.0.02.9.91,.0,7.25.2.85.8.9.08 o.9.64 4 4.77.o.4 —； 3 p 5 
3.3.3*2.2 2, 3* 2* 22.2 3 2 2 2 33.2 22.3.2.22.2* 3- 2* 3 2- 2 - 22223 33232 


4 9 5 5 7 3 9 6 2 09016 7 6 8 2 6 9 1 * 3 1 4.6.8J.0.7lrj.5.8.0.4 0 7 3 6 5 

6.vo.5.6.5.6.4.6.5,5.5.6.6.5.6.5*5.6.5.5,6.6,6.6.6,6.66.5.5 5 5 6 5 6 6 6 5 5 


2 2 2 2 2 4 3 2 2 1 2 2 112 4 4 3 3 3 2 4 2-5* 2 . 2422 , 2 . 24. 12 . 22 1- 2 2 

o.o.o f o.o.o.o.o.o.o.o.o.o.o.o.o.o.o.o.o.o.o.o.o.o,o.a.oocfo.a.o(l,QOOOOO 

.4.43.54.7.45-.45* 5,64 1.25.3.4.75.75*.0.7.9.6.6WJ 4.6.6.55 4 5.2.3.43 5. 

1^ 1» K 1^ 1 1 11 11 n 11 - 1^ n n I n n n I - 1 n 11 n 1 1 n I 1 u I 11 1L ^ n n n 1 

5 0.2 丨 .69 4..4.9J,74, o'.0.04*.95,,8.8.4.7.6.3.4o,.4.5.42* 1 ^ 4- 1- 2 I, 2 5* 6 o* 4. 
3*3*3.3.3.3.3.H2.3.3,3.3.3.4.4.3.3,3,rr rn‘3.3.3.3. 3 , 3 . 3 . 3 . 3 .3,3 4 4 3 3 3 3 3 3 


19 7 6 0..46,.04’.9.4.8.8.3S- 7— 
5* 4 - 4 - 4 - 5 - s^* 4 5- 4* 4- 5 4 4 - 4- 5 5 


4 li 7. • ― ' 1^- I 1^ 
5. -ri 5- 5. 5- 5* 4 5 


■,6100 刀 ,02 2 7.8.4,2,5.9,0.5,9.4 
^ 4* 5* 4. 5 5 ^ 4 4*5.5 f 5,4 5 5 4 4 





PROBLEMS 


111 


(b) In a sample of 97 male cats the relevant data are 

y ( 281.3) ^ = ( 836,75 3275.55 、 

LXa "[ 1098.3 Ji 3275,55 13056.17 / 

Find pi, 2, S 9 and p. 

3.6, Find and ( p if ) for Iris setosa from Table 3.4, taken from Edgar Anderson’s 
famous iris data [Fisher (1936)]. 


3.7. (Sec. 3,2) Invariance of the sample correlation coefficient Prove that r l2 is an 
invariant characteristic of the sufficient statistics x and S of a bivariate sample 
under location and scale transformations (^* a = ^i x i a 0> i = 1 ? 2, a = 

1 ， N) and that every function of x and S that is invariant is a function of 
r l2 . [Hint: See Theorem 2.3.2.] 

3*8, (Sec. 3.2) Prove Lemma 3.2.2 by induction. [Hint: Let H x = h\^ 


卜 x h { 


f ^ 2,， •. ， 


and use Problem 2.36.] 
3,9. (Sec. ： .2) Show that 


iV( N — 1) [ ^ X ° ~ ~ X ^y = 及 —1 [ ( X a )( X a _ 无)、 


Iris setosa 

Sepal Sepal Petal Petal 
length width length width 


Iris uir^nica 

Sepal Sepal Petal Petal 
length width length width 


Iris versicolor 


Sepal 

length 

Sepal 

width 

Petal 

length 

Petal 

width 

5,5 

2,6 

4.4 

1.2 

6A 

3*0 

4.6 

1.4 

5.8 

2,6 

4,0 

1.2 

5.0 

23 

3.3 

1.0 

5,6 

2.7 

4,2 

1.3 

5.7 

3.0 

42 

1.2 

5.7 

2.9 

4.2 

13 

6.2 

19 

4.3 

1.3 

5.1 

2.5 

3.0 

1.1 

5.7 

2.8 

4.1 

1.3 


Table 3.4. (Continued) 



(Note: When p = 1, the left-hand side is the average squared differences of the 
observations.) 
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3.10, (Sec. 3.2) Estimation of £ when (jl is known. Show that if constitute 

a sample from 2) and jjl is known, then (l/A^)E^ ml (jc a - M0i(x a - fJt)’ is 
the maximum likelihood estimator of 2. 

3*11 ， (Sec. 3.2) Estimation of parameters of a complex normal distribution. Let 
Zp.-.jZ/v be N observations from the complex normal distributions with mean 
0 and covariance matrix P. (See Problem 2.64.) 

(a) Show that the maximum likelihood estimators of 0 and P are 

^ i N . N 

e=Z= yv ^ Tv ^ (z«-z)(z a -z)*- 

a-= I I 

(b) Show that z has the complex normal distribution with mean 0 and covari¬ 
ance matrix (l/N)P, 

(c) Show that z and P are independently distributed and that NP has the 

distribution of H n a ^\W a W*, where are independently distributed, 

each according to the complex normal distribution with mean 0 and covari¬ 
ance matrix P, and n = N - l. 

3.12, (Sec. 3.2) Prove Lemma 3.2.2 by using Lemma 3.2.3 and showing N Iog| C\ - 
trCD has a maximum at C^ND~ l by setting the derivatives of this function 
with respect to the elements of C = 2 一 1 equal to 0. Show that the function of C 
tends to -oo as C tends to a singular matrix or as one or more elements of C 
tend to oo and/or 一 oo (nondiagonal elements); for this latter, the equivalent 
of (13) can be used. 

3.13, (Sec. 3.3) Let X a be distributed according to N{yc ay l i ) > where 

> 0. Show that the distribution of g = {l/T.cl)T.c a X a is M*Y ，（ l/[g)2]. 
Show that E = H a (X a — gc a ){X a — gc a ) f is independently distributed as 
Ea^i2 a Z^, where Z V ...,Z N are independent, each with distribution M0,2). 
[Hint: Let Z a = Lb of} X p , where b N& = and B is orthogonal.] 

3.14, (Sec. 3,3) Prove that the power of the test in (19) is a function only of p and 

[N l N 2 /(N 1 + N 2 )](l^ ]) - - pi (2) ), given ot. 

3,15* (Sec. 3.3) Efficiency of the mean. Prove that x is efficient for estimating ^ 

3.J6. (Sec. 3.3) Prove that x and S have efficiency [(N— l)/N] p{p+1 ^ 2 for estimat¬ 
ing px and 2. 

3.17. (Sec. 3.2) Prove that Pr{|^4| = 0} = 0 for >1 defined by (4) when N>p. {Hint: 
Argue that if Z* = (Z 1? ..., Z p \ then \Z*\ ^ Q implies A = Z*Z* r 4- 
^a~p-v\ z a z a 5s positive definite. Prove Pr{|Z*| = Z }} \Zf^ , | + E/j|Z f/ cof(Z /; ) 
= 0} == 0 by induction, 卜 2, … ， p.] 
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3.18. (Sec. 3.4) Prove 

/-4>(4> + X) _l = S(4> + X)"'. 
q>_q>(q> + X) l 4>=(4>- 1 +1- 1 ) —’• 

3.19. (Sec. 3.4) Prove (I/AOE^/jc 。一 |xXjc„ - is an unbiased estimator of 1 
when (x is known. 

3.20. (Sec. 3.4) Show that 

0(0+ 士 l JC+ + 'v=(4>-' +NJ.-')'\NJ.~'x+^- l v) 

3.21. (Sec. 3.5) Demonstrate Lemma 3.5.1 using integration by parts. 

3.22. (Sec. 3.5) Show that 

r CO -00 1 . ,| -OC 1 . 

人 ( f(y)(x- d ^^^ e ~' lLx ~ in \ (ixd y"j t) I/Xj)! " sv dy ' 

3.23. Let Z(k) = (Z J； (/:)), where i = 1,. ..,p, ; = 1_ ,q and /c = 1.2__ be a 

sequence of random matrices. Let one norm of a matrix A be = 

max,_, mod(fl"X and another be ^ 2 (^4) = E, ; -tvAA'. Some alternative 
ways of defining stochastic convergence of Z(k) to B (p x^) are 

(a) /V,(Z(fc) — B) converges stochastically to 0, 

(b) N 2 {Z>{k) - B) converges stochastically to 0, and 

(c) Z^{k) — b t j converges stochastically to 0, i — 1， • • • ， p ， _/ = 1 ， _ q. 

Prove that these three definitions are equivalent. Note that the definition of 
X(k) converging stochuslically lo a is ihai for every arhilrary positive 6 and 
we can find K large enough so that tor k> K 

<S} > 1 -e. 

3.24. (Sec. 3.2) Covariance matrices with linear structure [Anderson (1969)1. Let 





< n 


0) 
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where G 0 ， ... ， G g are given symmetric matrices such that there exists at least 
one (q -h l)-tup]et cr 0 , cr g such that (i) is positive definite. Show that the 

likelihood equations based on N observations are 


(ii) -ytrX- 1 G g + 去 tr4X _1 G s X _1 =0, 


g= 0，1. q. 


Show that an iterative (scoring) method can be based on 
q 

E 

*-0 


(m) f tr t ； -\G s ir_\G h &P = ^tr g = 0，1”.” 士 


where 



CHAPTER 4 


The Distributions and Uses of 
Sample Correlation Coefficients 


4 丄 INTRODUCTION 


In Chapter 2, in which the multivariate normal distribution was introduced, it 
was shown that a measure of dependence between two normal variates is the 
correlation coefficient p" = • In a conditional distribution of 

A ， … ， X q given X q ^ x =x p ，the partial correlation p iN+l ，…， p 

measures the dependence between X t and The third kind of correlation 
discussed was the multiple correlation which measures the relationship 
between one variate and a set of others. In this chapter we treat the sample 
equivalents of these quantities; they are point estimates of the population 
quantities. The distributions of the sample correlations are found. Tests of 
hypotheses and confidence intervals are developed ， 

In the cases of joint normal distributions these correlation coefficients are 
the natural measures of dependence. In the population they are the only 
parameters except for location (means) and scale (standard deviations) pa¬ 
rameters. In the sample the correlation coefficients are derived as the 
reasonable estimates of th^ population correlations. Since the sample means 
and standard deviations are location and scale estimates, the sample correla¬ 
tions (that is, the standardized sample second moments) give all possible 
information about the population correlations. The sample correlations are 
the functions of the sufficient statistics that are invariant with respect to 
location and scale transformations; the population correlations are the func¬ 
tions of the parameters that are invariant with respect to these transforma¬ 
tions. ♦ 


An Introduction to Multivariate StaUticai Analysis, Third Edition. By T. W. Anderson 
ISBN 0471-36091-0 Copyright © 2003 John Wiley & Sons, Inc. 
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sample correlation coefficients 


In egression theory or least squares, one variable is considered random or 
dependent, and the others fixed or independent. In correlation theory we 
consider several variables as random and treat them symmetrically. If we 
start with a joint normal distribution and hold all variables fixed except one ， 
we obtain the least squares model because the expected value of the random 
variable in the conditional distribution is a linear function of the variables 
held fixed. The sample regression coefficients obtained in least squares are 
functions of the sample variances and correlations. 

In testing independence we shall see that we arrive at the same tests in 
either care (i.e,，in the joint normal distribution or in the conditional 
distribution of least squares). The probability theory under the null hypothe¬ 
sis is the same. The distribution of the test criterion when the null hypothesis 
is not true differs in the two cases- If all variables may be considered random, 
one uses correlation theory as given here; if only one variable is random ， 
one us'es least squares theory (which is considered in some generality in 
Chapter 8). 

In Section 4.2 we derive the distribution of the sample correlation coeffi¬ 
cient, first when the corresponding population correlation coefficient is 0 (the 
two normal variables being independent) and then for any value of the 
population coefficient. The Fisher z-transform yields a useful approximate 
normal distribution. Exact and approximate confidence intervals are devel¬ 
oped. In Section 43 we carry out the same program for partial correlations, 
that is，correlations in conditional normal distributions- In Section 4.4 the 
distributions and other properties of the sample multiple correlation coeffi¬ 
cient are studied. In Section 4.5 the asymptotic distributions of these cor¬ 
relations are derived for elliptically contoured distributions. A stochastic 
representation for a class of such distributions is found. 

4.2. CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 

4.2.1. The Distribution When the Population Correlation Coefficient Is Zero ； 
Tests of the Hypothesis of Lack of Correlation 

In Section 3.2 it was shown that if one has a sample (of p-component vectors) 
x u ...,x N from a normal distribution, the maximum likelihood estimator of 
the correlation between X t and Xj (two components of the random vector 
X)is 

r >i = 



⑴ 
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where x ia is the ith component of x a and 


( 2 ) 



N 

^ X ia - 

k = l 


In this section we shall find the distribution of r l} when the population 
correlation between X { and Xj is zero, and we shall see how to use the 
sample correlation coefficient to test the hypothesis that the population 
coefficient is zero* 

For convenience we shall treat r l2 ; the same theory holds for each r". 
Since r 12 depends only on the first two coordinates of each x a , to find the 
distribution of r n we need only consider the joint distribution of {x [[y x 2l ), 
(x 12 , x 22 X ■•” ( x in^ X 2 N). We can reformulate the problems to be considered 
here ， therefore, in terms of a bivariate normal distribution. Let x*,..., be 
observation vectors from 


(3) 

We shall consider 

(4) 

where 

(5) 


N 


1 

\ 




! 



、 a 2 a l P a 2 j 



M2 




a ij == E 0 化一 &)( 勺。一 5 )， 


L 卜 1，2. 


and x; is defined by (2) ， x io being the zth component of jc*. 

From Section 3,3 we see that a n , a l2 , and a 22 are distributed like 

n 

(6) E Z , a Zya. U = l ， 2, 

a= 1 

where n = N — 1, (z la , z 2a ) is distributed according to 

(7) N 


■ 2 


a \ °"l °"：2 P 
^•2 




and the pairs (z u , z xl ),._,(z 1W) z 2w ) are independently distributed. 
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f-2 



Figure 4.1 


Define the «-component vector v t = (z xl ,.. z in )\ i = 1,2. These two 
vectors can be represented in an «-dimensional space; see Figure 4.1. The 
correlation coefficient is the cosine of the angle, say 6, between v x and r 2 . 
(See Section 3.2.) To find the distribution of cos 6 we shall first find the 
distribution of cot 6. As shown in Section 3.2, if we let b = theu 

- br { is orthogonal lo r, and 



cot 6 


b\\^\\ 
-叫 II 


If i\ is fixed, we can rotate coordinate axes so that the first coordinate axis 
lies along Then bv l has only the first coordinate different from zero, and 
v 2 - bv x has this first coordinate equal to zero. We shall show that cot 6 is 
proportional to a 卜 variable when p = 0. 

Wc use the following lemma. 

Lemma 4.2.1. IfY'”.”Y n are independently distributed, ifY Q = Kj 2), ) 

has the density f(y a ), and if the conditional density of Y^ 2) given Kj ]) =yi l) is 
a = 1,.. then in the conditional distribution of , K n (2) 

giuen = y \ { \..., y { n l) = ^ l) , the random vectors Y^ 2 \ … ， K n (2) are independent 
and the density of F a (2) is a = 1， . • • ， n. 

Proof. The marginal density of — , K n (l) is Vl n „ = i f^y^X where f x (y ( a l) ) 
is the marginal density of Kj l) , and the conditional density of Y [ 2) ,..., Y^ 2) 
given y'( "=//) ， … ，打 1 )=<) is 

(9) - 1 】:' 1 ’⑻ =rr = ft /(v ( 2)| (Dx ■ 



4,2 CORRELATION COEFFICIENT OF A BIVARIATE SAMPLE 


119 


Write V { = i = l ， 2 3 to denote random vectors. The condi¬ 

tional distribution of Z 2a given Z la =z Ia is iV( j8z la ， cr 2 )，where (3= p<r 2 /o\ 
and (t 2 = <j 2 2 (1 - p 2 ), (See Section 2.5J The density of V 2 given V l = v Y is 
N( (3v v a 2 I) since the Z 2(f are independent. Let b = V! l v i /v\v x (-a 2i /a n \ 
so that bif^{y 2 - b^) = 0, and let = (K 2 - - bv { ) - V[V 2 - b 2 r , 1 r l 

fl 22 ~ a n/ a \0- Then cot 6 = b^/a n /U • The rotation of coordinate axes 
involves choosing an n X n orthogonal matrix C with first row (1 /c)r\, where 

We now apply Theorem 3,3.1 ^ith X a = Z 2a . Let F„= E^c a ^Z 2 ^, a = 
1 ， … ， n. Then Y u ... y Y n are independently normally distributed with vari¬ 
ance a 2 and means 

(10) SY X = E C iy^ Z \y = § L Ay = ^ C > 

y —1 7=1 

n /i 

(11) SY a = E /3c E C a 7 ^l 7 = a—l- 

y= 1 y- 1 

We have b = L'^ [ Z 2l ,z u ,/L'^ { zl, = cr；, = { Z 2a c u ,/c 2 ^ Y,/c and, from 
Lemma 3.3.1, 

(12) v= Ed 2 

a —I a = 1 a =1 

=E 匕 2 ， 

a = 2 

which is independent of b. Then U/c 2 has a ^^distribution with n — 1 
degrees of freedom. 

Lemma 4.2,2, If (Z la9 Z 2a \ a = 1,". ， a ，are independent, each pair with 
density (7), 出 en [he conditional distributions of b = E^ = ,Z 2a Z Ia /E^ =1 Zi a 
and U/a 2 = L n Q=l (Z 2a - bZ ]a ) 2 /o- 2 given Z la =z la , a = 1,..., n s are 
N( j3, a 2 /c 2 ) (c 2 = Yl n aa yZ^ a ) and x 1 n - 1 degrees of freedom, respec¬ 
tively] and b and U are Independent. 

If p=0 7 then 0 = 0, and b is distributed conditionally according to 
iV(0, a 2 /c 2 \ and 


c/> 


f/ 


b 

/ 

cb 


/- 


13) 
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has a conditional ^-distribution with n — l degrees of freedom. (See Problem 
4.27.) However, this random variable is 


(14) ^ 


yj a \\ a \i/ a \\ —— y 


\/ fl 22 - a h/ a u 


\/l - [ fl ?2/( fl ll fl 22)] 


)/n — 1 




Thus \/« - 1 r/\/l - r 2 has a conditional ^-distribution with n - 1 degrees of 
freedom. The density of t is 


(15) 


KM 


■ in 


yjn - \ r[ — \)\yfrr \ 行 -1 


and the density of W = r/ )/l - r 2 is 
(16) 


— Eiii)_(l + w 2 ) — 

r [ 洳 -l)]# ’ • 

Since tv = r(l - r 2 ) 一 ^ we have dw/dr = (1 一 r 2 )~ K Therefore the density of 
r is (replacing n by N - l) 


(17) 


r[|(N- 1)] ^ _ r 2^(W-4) 


r[^(N-2)]^ 


It should be noted that (17) is the conditional density of r for v x fixed. 
However, since (17) does not depend on it is also the marginal density 


of r. 


Theorem 4.2.1. Let X\,... ， X N be independent，each with distribution 
AT(|ji, X). If = 0, the density of r l} defined by (l) is (17). 

From (17) we see that the density is symmetric about the origin. For 
N> 4 7 it has a mode at r = 0 and its order of contact with the r-axis at ± 1 is 
|(iV - 5) for N odd and jJV — 3 for N even. Since the density is even, the 
odd moments are zero; in particular, the mean is zero. The even moments 
are found by integration (letting x = r 2 and using the definition of the beta 
function). That <^r 2m =T[^(N- l)]T(m + ^)/[y/7rT[^(.N — l) + m]} and in 
particular that the variance is 1/(N — 1) may be verified by the reader. 

The most important use of Theorem 42,1 is to find significance points for 
testing the hypothesis that a pair of variables are not correlated. Consider the 
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hypothesis 

(18) Ji: Plj ^0 

for some particular pair 0,jX It would seem reasonable to reject this 
hypothesis if the corresponding sample correlation coefficient were very 
different from zero. Now how do we decide what we mean by “very 
different ”？ ^ 

Let us suppose we are interested in testing H against the alternative 
hypotheses p {j > 0* Then we reject H if the sample correlation coefficient r l} 
is greater than some number r 0 . The probability of rejecting H when H is 
true is 

(19) f l k N (r)dr, 

r Q 

where k N (r) is (17), the density of a correlation coefficient based on N 
observations. We choose r 0 so (19) is the desired significance level. If we test 
H against alternatives p l} <0, we reject H when r (j < -r 0 . 

Now suppose we are interested in alternatives p tj ^ 0; that is, p ]f may be 
either positive or negative. Then we reject the hypothesis H if r (j > r, or 
r" < 一 r r The probability of rejection when H is true Is 

(20) / ' N (r)dr+ f 、 k N (r)dr. 


The number r { is chosen so that (20) is the desired significance level 

The significance points r { are given in many books，including Table VI of 
Fisher and Yates (1942); the index n in Table VI is equal to oar N — 2. Since 
VN —2r/vl — r 2 has the ^distribution with N -2 degrees of freedom, 
r-tables can also be used Against alternatives p" # 0, reject H if 


{ N ~-2 


V 1 


> l N-2( a )> 


where t N _ 2 (a) is the two-tailed significance point of the 卜 statistic with N - 2 
degrees of freedom for significance level a. Against alternatives p (j > 0. 
reject H if 


(22) 


2 >t N - 办 a). 

V 1 ^- r <i 
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From (13) and (14) we see that y/N 一 2 r/ v 1 - r 2 is the proper statistic for 
testing the hypothesis that the regression of V 2 on i\ is zero. In terms of the 
original observation {x la }. we have 



y/^-\[ X 2c, — 无 ： ~ b ( X \ cr _ D]V(W _2 ) 


where b = - AsXx lfl[ -x v ) 2 is the least square’s re¬ 

gression coefficient of as (I on x ]ift li is seen that the test of p 12 = 0 is 
equivalent to the test that the regression of X 2 on x, is zero (Le., that 
= 0乂 

To illustrate this procedure we consider the example given in Section 3*2. 
Let us test the null hypothesis that the effects of the two drugs are uncorre¬ 
lated against the alternative that they are positively correlated. We shall use 
the 59c level of significance. For N — 10, the 5% significance point (r 0 ) is 
0.5494. Our observed correlation coefficient of 0.7952 is significant; we reject 
the hypothesis that the effects of the two drugs are independent. 


4.2.2. The Distribution When the Population Correlation Coefficient Is 
Nonzero ； Tests of Hypotheses and Confidence Intervals 


To find the distribution of the sample correlation coefficient when the 
population coefficient is different from zero, we shall first derive the joint 
density of a lh a n , and a 12 . In Section 4.2.1 we saw that, conditional on 
held fixed* the random variables b = fl 12 /flu and U/a 2 = (a 22 — a n/ a u^/°' 2 
arc distributed independently according to N( /3, a 2 /c 2 ) and the ^ 2 -distribu- 
tion with 一 1 degrees of freedom, respectively. Denoting the density of the 
^^distribution by g„„,(uX we write the conditional density of b and U as 
n{b\(i. a 1 /a \\)g n -\iu/ o- 1 )/a 1 . The joint density of K,, b, and U is 
l)n(b\ (3, o- 2 /a u )g tl The marginal density of 

V[V y /<t{ = a n /o-{ is g tl (u)\ that is, the density of is 

(24) = /■•/ n{v^,^I)dW, 


where dW is the proper volume element. 

The integration is over the sphere v\v y = thus, dW is an element of 
area on this sphere. (See Problem 7.1 for the use of angular coordinates in 
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defining dW.) Thus the joint density of b, V, and a n is 


(25) j - ' j n(b\^, 'a 2 /a n )g n _ x {u/a l )—n{v x \{\, a^I)d\V 

= g n { a u/ (r l) n ( b ^> (T2 / a u)Sn-l{ U /^ 2 ) 

(i^a 2 

( 2 of 广 rO) v 2of J V2 tto- 2 P L 2o- 2 


(2V 广 "mi)] 


” xp (- 点 ). 


Now let b = fl 12 /fl u , f/ = a 22 — fli 2 /fl n . The Jacobian is 


^(b,u) 
d{a x2 ^ a 22 ) 


1 a " 


Thus the density of a ]v a )2 , and a 22 for a u > 0, a 22 ^ 0, and a li a 22 - a 2 [2 >0 
is 


/ 2 \ ^(w-3) 

a |(„-3) fl n a 22~ a l2 I g -扣 

2v,v(i - P 2 )N^r(»r [办 -1,] 


where 


㈣ + + 




Ml O', 




2 2 2 


a 、 — j- _ 〆 i 厂 」 _ _ 2a _ i_ 

l Vi +2 2 (1 - P 2 ) j 12 〒 2 2 (1 - P 2 ) 4(1 - P 2 ) 


1 ^ 7 ( 筇 - 
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The density can be written 
(29) 




2 n isi^r(^)r[K«-i)] 


for A positive definite, and. 0 otherwise. This is a special case of the Wishart 
density derived in Chapter 7. 

We want to find the density of 


(30) 


M2 


a l2/( a l a 2) 


*12 


]/ a \l a 22 \/( % /^l 2 K a 22/ a 2 ) ]/ a U a 22 ， 


where a* ( = a u /cr 1 2 , a* 2 = a 22 /cr 2 2 , and a* 2 — a 12 /(o"] cr 2 ). The transformation 
is equivalent to setting a { ~ a 2 ~ l. Then the density of a u , a 22 , and 
r = a 12 7^11^22 (^12 = dr^a u a 22 ) is 


(31) 
where 

(32) 


a^-' a^~' (l - r 2 y (n ~ 3) e~ ^ 

2 n (i-p 2 )" n ^r(^)r[i(«-i)] 


Q 


fl ii — 2pr yfa^ x y[a^ + a 


22 


1-p 2 


To find the density of r, we must integrate (31) with respect to a n and a 22 
over the range 0 to oo. There are various ways of carrying out the integration, 
which result in different expressions for the density. The method we shall 
indicate here is straightforward. We expand part of the exponential: 


(33) 


exp 


P r yJ a u V fl 22 

(1 —〆 ） 


h a\(l-p 2 ) a 


Then the density (31) is 

( 1 -r 2 严 -3) 


( 34 ) 


(1~V 严 ( 士 n)r[K«~l )] 二 «!(1-P 2 ) 


E 


{prY 


.{exp 




2(1-P 2 )J 


(« + a )/2-1 

U 


exp 


*22 


2(1 - p 2 ) J 


(« + a )/2-1 
22 
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Since 


(35) f a T<« + CT )~ 1 


2(1-P 2 ) 


da-r[Hn + a)][2{l- p ^' 


(ft+ n : 


the integral of (34) (term-by-term integration is permissible) is 


(36) 


(l-r 2 )- 


4(/i-3) 


(I-〆 广 2V^r(»r [士 (”― l)] 


OO ( .a 

a = 0 a!(l — P ) 

« {2prT r2f i + 

^r ( i.)r[^(n-i)] h a! Li( )J 

The duplication formula for the gamma function is 

2h-ir(z)(z + |) 


(37) 


r(2z) 


/*7T 


It can be used to modify the constant in (36). 

Theorem 4.2.2. The correlation coefficient in a sample of N from a bivariate 
normal distribution with correlation p is distributed with density 

(38) ( 卜 2)!tt U{n + Q)J ' 


1 ^1, 


where n — N — l. 


The distribution of r was first found by Fisher (1915). He also gave as 
another form of the density, 


(39) 


(i- P 2 广 (i - ―广 

7r(n ~ 2)! 


3) 


d n -' [ cos*" 1 ( —x) 

dx n - 1 \ 


lz=rp 


See Problem 4.24, 
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Hotelling (1953) has made an exhaustive study of the distribution of /_• He 
h*a.s recommended the following form: 


(40) 






(1- P r)_ n+ Y 去 


1 1 + P r 

2; — 2~ 


where 

(41) 


F{a,b\c\x) 


y r(a+j) T(b+j) r(c) ^ 

L T(a) T(b) T(c+j) }\ 


is a hypergeometric function. (See Problem 4.25.) The scries in (40) converges 
more rapidly than the one in (38). Hotelling discusser methods of integrating 
the density and also calculates moments of r. 

The cumulative distribution of r. 


(42) Pr{r <;r*) =F(r*|A^, p), 

has been tabulated by David (1938) for 1 p— 0(.1).9, V = 3(1)25, 50, 100, 200, 
400, and r* = —1( 05)1. (David’s n is our M) It is clear from the density (38) 
ihat F0.*\N 3 p) = 1 - F(-r*\N y — p) because the density for r, p is equal to 
the density for —r ，一 p. These tables can be used for a number of statistical 
procedures. 

First, we consider the problem of using a sample to test the hypothesis 


(43) 


p = p (” 


If the alternatives are p> p 0 , we reject the hypothesis if the sample correla¬ 
tion coefficient is greater than r 0 , where r 0 is chosen so 1 — F(r 0 \N, p 0 ) = a, 
the significance level. If the alternatives are p< p 0 , we reject the hypothesis 
if the sample correlation coefficient is less than r[ u where is chosen so 
p 0 ) = a. If the alternatives arc p ^ p 0 , the region of rejection is r > r, 
and r <r\, where r, and rj are chosen so [1 — FXrjN ， p 0 )] + p 0 ) = a. 

David suggests that r { and r\ be chosen so [1 - F(r } \N y p 0 )] = F(r\\N, p 0 ) 
― \a. She has shown (1937) that for N> 10, \p\ < 0.8 this critical region is 
nearly the region of an unbiased test of H, that is, a test whose power 
function has its minimum at p Q . 

It should be pointed out that any test based on r is invariant under 
transformations of location and scale, that is, = b t x ia + c iy b t > 0, i = 1,2, 


V - (X # 1X9 means p — 0,0.1^ 0.2,.,., 0 # 9, 
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Table 4.1. A Power Function 


p 

Probability 

-1,0 

0.0000 

一 0,8 

0.0000 

一 0.6 

0.0004 

-0.4 

0.0032 

一 0.2 

0.0147 

ao 

0.0500 

0.2 

0.1376 

0.4 

0.3215 

0.6 

0.6235 

0.8 

0.9279 

1.0 

1.0000 


a — 1,...,/^; and r is essentially the only invariant of the sufficient statistics 
(Problem 3.7). The above procedure for testing H: p 0 against alterna¬ 

tives p> p 0 is uniformly most powerful among all invariant tests. (See 
Problems 4.16, 4.17, and 4.18.) 

As an example suppose one wishes to test the hypothesis that p = 0.5 
against alternatives p 关 0.5 at the 5% level of significance using the correla¬ 
tion observed in a sample of 15. In David’s tables we find (by interpolation) 
that F(0.027| 15,0.5) = 0,025 and F(0.805| 15,0,5)^ 0,975. Hence we reject 
the hypothesis if our sample r is less than 0.027 or greater than 0.805. 

Secondly, we can use David’s tables to compute the power function of 
a test of correlation. If the region of rejection of H is r>r y and r<r[ 9 
the power of the test is a function of the true correlation p, namely 
[1 — F(ri\N, p) + [F(rilW, p)J ； this is the probability of rejecting the null 
hypothesis when the population correlation is p. 

As an example consider finding the power function of the test for p = 0 
considered in the preceding section. The rejection region (one-sided) is 
r> 0.5494 at the 5% significance level. The probabilities of rejection are 
given in Table 4.1. The graph of the power function is illustrated in Figure 
4.2. 

Thirdly, David’s computations lead to confidence regions for p. For given 
N 9 r[ (defining a significance point) is a function of p, say f x ( p\ and 。 is 
another function of p, say / 2 ( p), such that 

(44) Pr{/i( P) <r< f2( p)Ip} = 1 ~ 

Clearly, j\( p) and / 2 ( p) are monotonically increasing functions of p if r } 
and r[ are chosen so 1 — F(r x \ N. p)= {a — F(r\\N 9 p) r If p=ff l (r) in the 
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inverse of r—f t ( p), i = 1,2, then the inequality f { ( p) <r is equivalent to f 
p </7 1 (r), and r </ 2 ( p) is equivalent to f^ l (r) < p. Thus (44) can be written 

(45) Pr{/T( r )〈"〈/T V)lp 卜 1 -a. 

This equation says that the probability is 1 — a that we draw a sample such 
that the interval (/^'(r), covers the parameter p. Thus this interval is 

a confidence interval for p with confidence coefficient 1 - a. For a given N 
and a the curves r =/j( p) and r^f 2 ( p) appear as in Figure 43. In testing 
the hypothesis p = p 0 , the intersection of the line p = p 0 and the two curves 
gives the significance points r 丨 and r\. In setting up a confidence region for p 
on the basis of a sample correlation r*，wc find the limits /7 ^r*) and 



Figure 43 


+ The point (/,(p), p) on the first curve is to the left of (r, p), and the point (r,/f l (r)) is above 
(r, pX 
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by the intersection of the line r = with the two curves. David gives 
these curves for a = 0.1, 0,05, 0.02, and 0.01 for various values of M One¬ 
sided confidence regions can be obtained by using only one inequality above. 

The tables of F(r\N y p) can also be used instead of the curves for finding 
the confidence interval. Given the sample value r*, /j'Kr*) is the value of p 
such that ya = Pr{r ^r*|p}p), and similarly is the value 

of p such that ^a = Pr{r>r*|p} = 1 — Fir^lN, p\ The interval between 
these two values of p, (/J is the confidence interval. 

As an example, consider the confidence interval with confidence coeffi¬ 
cient 0,95 based on the correlation of 0.7952 observed in a sample of 10, 
Using Graph II of David, we lind the two limits are 0,34 and 0.94. Hence we 
state that 0.34 < p< 0,94 with confidence 95%, 

Definition 4.2.1. Let L(x t ^) be the likelihood function of the observation 
vector x and the parameter vector 0 e fl. Let a null hypothesis be defined by a 
proper subset a> of Cl. The likelihood ratio criterion is 

sup 0eu L( X> B) 

"sup Oell L(x ， 0) 


The likelihood ratio test is the procedure of rejecting the null hypothesis when 
\(x) is less than a predetermined constant. 


Intuitively, one rejects the null hypothesis if the density of the observa¬ 
tions under the most favorable choice of parameters in the null hypothesis is 
much less than the density under the most favorable unrestricted choice of 
the parameters. Likelihood ratio tets have some desirable features; see 
Lehmann (1959)，for example. Wald (1943) has proved some favorable 
asymptotic properties. For most hypotheses concerning the multivanate 
normal distribution, likelihood ratio tests are appropriate and often are 
optimal. 

Let us consider the likelihood ratio test of the hypothesis that p = p 0 
based on a sample f rom the bivariate normal distribution. The set 

fi consists of /jl {9 p 2 ， a l9 <x 2 ，and p such that > 0, a 2 > 0. 一 1 <p<l. 
The set o) is the subset for which p = p 0 . The likelihood maximized in O is 
(by Lemmas 3,2.2 and 3,2.3) 


(47) 


maxL 

n 


n 


N N e 


(27r) w (l - r 2 )* >S a 


■v/： y 
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Under the null hypothesis the likelihood function is 


(48) 


(2开广(1 - Po ) ： A / ( o - 2 ) 


「exp 


a ll/ T + Ta 22 ~ ^Po a l2 

2o- 2 (l ~Po) 


where a 1 = and r= cry/^. The maximum of (48) with respect to 
occurs at r= y^7 / The concentrated likelihood is 


(49) 


1 


(2开，(1-必〜2) 
the maximum of (49) occurs at 


77 ex P 


- Po r ) 


(T 


? (1-P0 2 ) 


(50) 


a 


a 2i( ^ ~ Pp r ) 

^-pI) 


The likelihood ratio criterion is, therefore, 


(51) 


max.L (l- P ^ N (l-r^y N 


max n L 


(l-PoO W 


(1 - Po)(l -r 2 ) 

(1 _ PoO 2 


iN 


The likelihood ratio test is (1 — pgXl - r 2 Xl - p 0 r) -2 <c, where c is chosen 
so the probability of the inequality when samples are drawn from normal 
populations with correlation p 0 is the prescribed significance level. The 
critical region can be written equivalently as 


(52) 


or 


(53) 


(Po^ - Po + 1 ) r2 ~ 2p 0 cr^c- 1 + > 0, 


> 


< 


Po c + (丄 - Pq)^! - c. 
Po c + 1 — pS 


Pq c - （1 _ Pq)^! - c 
Po c + 1 _ 


Thus the likelihood ratio test of H: p - p 0 against alternatives p # p 0 has a 
rejection region of the form r> r y and r < r\\ but r, and r\ are not chosen so 
that the probability of each inequality is a/2 when H is true，but are taken 
to be of the form given in (53), where c is chosen so that the probability of 
the two inequalities is a. 
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4.2*3* The Asymptotic Distribution of a Sample Correlation Coefficient 
and Fisher’s z 

In this section we shall show that as the sample size increases, a sample 
correlation coefficient tends to be normally distributed. The distribution of a 
particular function of a sample correlation, Fisher’s z [Fisher (1921)]，which 
has a variance approximately independent of the population correlation, 
tends to normaaty faster. 

We are particularly interested in the sample correlation coefficient 


(54) 


' ' ^ A A n )^jj( n ) 


for some i and /, i ^ j. This can also be written 

c tJ ( n ) 


(55) 


<n) 


yfcMcJnj' 


where C gh (n) ^A^ h (n)/ , The set C u (n\ C j} {n), and C l} {n) is dis¬ 

tributed like the distinct elements of the matrix 


(56) E 




{zr a ,z%)= i ： 


Zja/yfo)j 




where the (Z * a9 Z* a ) are independent, each with distribution 


N 


0\ 

0 


where 


Let 


(57) 


P 




U(n) 


n 


Cjj( n ) 

C u( n ) 
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Then by Theorem 3,4,4 the vector /n[U(n) — b] has a limiting normal 
distribution with mean 0 and covariance matrix 

2 2p 2 2p ' 

(59) 2p 2 2 2p . 

k 2p 2p 1 + P' 

Now we need the general theorem: 

Theorem 4.2»3* Lei {tXn)} be a sequence of m-component random vectors 
and b a fixed vector such that [U{n) - b] has the limiting distribution 7^(0, T) 
as co. Let f(u) be a vector-valued function of u such that each component 
fj(u) has a nonzero differential at u — b 9 and let du x \ u ^ b be the i 9 jth 

component Of ^ b . Then /n {f[u(n)]—f(b)} has the limiting distribution 

卵，伞; T %). 

Proof. See Serfling (1980), Section 3.3, or Rao (1973), Section 6a.2. A 
function g(u) is said to have a differential at b or to be totally differentiable 
at b if the partial derivatives dg(u)/du, exist at m = 6 and for every «> 0 
there exists a neighborhood N e (b) such that 

(60) 

g(u) -g(b) - E ' d gu~( u i~ b i) ^ ^ll« -*ll for all m e/V c (ft). ■ 

i= l 1 

It is clear that U(n) defined by (57) with b and T defined by (58) and (59 )， 
respectively, satisfies the conditions of the theorem. The function 

W 3 _i> 一 1 

(61) r=- 7 = = u 3 M 1 2 w 2 2 

yu x u 2 
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and f(b) = p. The variance of the limiting distribution of \/n [K«) - p] is 




1 2 

2p 2 

2 P 1 

I 

P 

(63) 

(一 Ip ， 一士 p，0 

2p z 

2 

2 P 


~\p 



1 2 P 

2 P 

1 + p 2 


, 1 


=(p-p\ p- p\ 1 - p 2 ) - \p 

, 1 - 

=1 一 2p 2 + p 4 

♦ ， : T' 

Thus we obtain the following: 


Theorem 4.2.4. If r(n) is the sample correlation coefficient of a sample of N 
n + 1) from a normal distribution with correlation p, then ^n[r{n) - p]/ 
(1 - p 2 ) [or y[N[r(n) — p]/(l — p 2 )] has the limiting distribution MO, 1). 


It is clear from Theorem 4.2.3 that if fix) is differentiable at x = p. then 
^n[f(r) — /(p)] is asymptotically normally distributed with mean zero and 
variance 



A useful function to consider is one whose asymptotic variance is constant 
(here unity) independent of the parameter p. This function satisfies the 
equation 

(64) /，( p) = 1^7 = 去 (T^ + _ 

Thus /(p) is taken as ^[log(l + p) - log(l - p)] = 4 log(l + p)/(l - p)]. The 
so-called Fisher's z is 


(65) 


㈣栏 


tanh' 


where r = tanh z — {e z — e^ z )/{e z + e~ z ). Let 

1 + p 


Mog 


1 一 p 


( 66 ) 
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Theorem 4.2.5. Let z be defined by (65), where r is the correlation coeffi¬ 
cient of a sample of N (= n + 1) from a bivariate normal distribution with 
correlation p ： let f be defined by (66). Then - [) has a limiting normal 
distribution with mean 0 and variance 1. 


It can be shown that to a closer approximation 


(67) Sz- 1+ ^ . 

(68) /(:- n 、 右) • 

The latter follows from 


(69) 


忒 u-w 


1 ,8 

— I — 

71 4 ^- 


P 1 


and holds good for p 2 /n 2 small Hotelling (1953) gives moments of z to order 
n An important property of Fisher's z is that the approach to normality is 
much more rapid than for r. David (1938) makes some comparisons between 
the tabulated probabilities and the probabilities computed by assuming z is 
normally distributed. She recommends that for N > 25 one take z as nor¬ 
mally distributed with mean and variance given by (67) and (68)，Konishi 
(1978a, 1978b, 1979) has also studied z, [Ruben (1966) has suggested an 
alternative approach, which is more complicated, but possibly more accurate.] 
We shall now indicate how Theorem 4.2.5 can be used. 


a. Suppose we wish to test the hypothesis p = p 0 on the basis of a sample 
of N against the alternatives p 关 P[y We compute /* and then z by (65). Let 


(70) 


, 1 + Pn 

^ = ^ log T^ 


Then a region of rejection at the 5% significance I ;vel is 


(71) 

A belter region is 

(72) 


vVv - 3 1 z — f 0 1 > 1.96, 


V/V-3 z- ^ 



> 1.96. 


b. Suppose we have a sample of N\ from one population and a sample of 
N: from ci second population. How do wc test the hypothesis that the two 
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correlation coefficients Lre equal, — p 2 ? From Theorem 4.2.5 we know 
that if the null hypothesis is true then z { - z 2 [where z l and z 2 are defined 
by (65) for the two sample correlation coefficients] is asymptotically normally 
distributed with mean 0 and variance 1 八 N' 一 3) + \/(N 2 — 3). As a critical 
region of size 5%， we use 


(73) 


_l^i -^1_ 

^/(M-3)+1 /( 乂一 3) 


> 1.96. 


c. Under the conditions of paragraph b, assume that pj = p 2 = p. How 
do we use the results of both samples to give a joint estimate of p? Since z x 
and z 2 have variances l/(A^—3) and 1/(N 2 - 3), respectively，we can 
estimate C by 

(N' - 3)2i + (N 2 - 3) z 2 


and convert this to an estimate of p by the inverse of (65). 

d. Let r be the sample correlation from N observations. How do we 
obtain a confidence interval for p? We know that approximately 

(75) Pr{-1.96 < ^fN^3 ( 2 -^：) <1.96} =0.95. 

From this we deduce that [—1.96/^ — 3 + 2 , 1.96/^ - 3 + 2 ] is a confi¬ 
dence interval for 岑 . From this we obtain an interval for p using the fact 
p — tanh I = (e^ - + ), which is a monotonic transformation. 

Thus a 95% confidence internal is 


(76) tanh (2 — 1.%/t/N — 3 ) < p < tanh (2 + 1.96/^N — 3 ). 


The bootstrap method has been developed to assess the variability of a 
sample quantity. See Efron (1982). We shall illustrate the method on the 
sample correlation coefficient, but it can be applied to other quantities 
studied in this book. 

Suppose is a sample from some bivariate population not neces¬ 

sarily normal. The approach of the bootstrap is to consider these N vectors 
as a finite population of size AT; a random vector X has the (discrete) 
probability 


(77) 


Pr{^ = ^a} = 77 > 


a = 1,..., TV 
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A random sample of size N drawn from this finite population has a probabil¬ 
ity distribution, and the correlation coefficient calculated from such a sample 
has a (discrete) probability distribution, say p N (r). The bootstrap proposes to 
use this distribution in place of the unobtainable distribution of the correla¬ 
tion coefficient of random samples from the parent population. However, it is 
prohibitively expensive to compute; instead p N (r) is estimated by the empiri¬ 
cal distribution of r calculated from a large number of random samples from 
(77). Diaconis and Efron (1983) have given an example of /V= 15; they find 
the empirical distribution closely resembles the actual distribution of r 
(essentially obtainable in this special case). An advantage of this approach is 
that it is not necessary to assume knowledge of the parent population; 
a disadvantage is the massive computation. 


43. PARTIAL CORRELATION COEFFICIENTS; 
CONDITIONAL DISTRIBUTIONS 


43.1. Estimation of Partial Correlation Coefficients 

Partial correlation coefficients in normal distributions are correlation coeffi¬ 
cients in conditional distributions. It was shown in Section 2.5 that if X is 
distributed according to N(|x, X), where 


(i) 


X 


义⑴、 

X( 2 ) 


M- 


W 1 ) 


2 



then the conditional distribution of X il) given X i2) =x (2) is N[pS l) + p(jc (2) — 
^ (2) ), 2 n . 2 L where 


( 2 ) 

( 3 ) 


P = UJ ， 


乏 11.2 = 幺11 一 ^12^22 ^21' 


The partial correlations uf 尤⑴ given x i2) are the correlations calculated in 
the usual way from 2 n . 2 . In this section we are interested in statistical 
problems concerning these correlation coefficients. 

First we consider the problem of estimation on the basis of a sample of N 
from I). What are the maximum likelihood estimators of the partial 
correlations of AT (1) (of q components), p 1; . 9 + 1 ,We know that the 
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maximum likelihood estimator of I is (1 /N)A, where 


1 


N 

L 


x^-x^ 
x (2) _ -(：) 


(x ^ 1 -x il) \x^ ), -x 


(2>，、 


_ ^12 ' 

^21 A l2 〗 

and 5-(1 /N)T.^ { x a = (x (l), x {2), Y. The correspondence between 1 and 
2 u . 2 , P，and 2 22 is one-to-one by virtue of (2) and (3) and 

(5) \ 2 H 2 ， 

( 6 ) + 


We can now apply Corollary 3.2.1 to the effect that maximum likelihood 
estimators of functions of parameters are those functions of the maximum 
likelihood estimators of those parameters. 


Theorem 4.3.1. Let be a sample from /V(|jl ， I )， where |x 

and 1 are partitioned as in (1). Define A by (4) and (i ⑴’无 
(l/A^)Ea=i(^a y Then the maximum likelihood estimators of jt (l \ 

P ， in. 2 > ^22 are l^ (I) = 又 : ⑴， jx (2) -x {2 \ 

( 7 ) P == i 4 12 i4 22 ^ ^ 11-2 = ^12^22^21)^ 

and respectively. 


In turn, Corollary 3.2,1 can be used to obtain the maximum likelihood 
estimators of |x( 2 ) ， p ， 1 22 , fT,,., r ,. , r / = 1,...,?. and p lr , l + L 
i y j = 1,...,^. It follows that the maximum likelihood estimators of the partial 
correlation coefficients are 






f iyq+ i - .. p 


^irq+ l + l.. 


where d- t} . q + { . p is the ij'th element of i ir2 . 



L 


q' 
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Theorem 43*2* Let x v ... 9 x N be a sample of N from A^(|x ， X). The 
maximum likelihood estimators of p 0 . q + l p ，the partial correlations of the first 
q components conditional on the last p - q components，are given by 


⑼ 

where 


a ^yq +1 、."，p 




( 10 ) . p ) =^11 -^ 12 ^ 22^21 =^ 11 - 2 ' 

The estimator p tyq+ i p , denoted by r iyq + i is called the sample 
partial correlation coefficient between and X } holding X q+l ， … ， X p fixed. It is 
also called the sample partial correlation coefficient between and Xj 
having taken account of A^ + 丨 ， … ， X p . Note that the calculations can be done 
in terms of (r )• 

The matrix A n , 2 can also be represented as 

(11) L -^ (1) - P(^ ：) -^ 2) )] [^ -^ (，) - P(^ 2) - J(2) )l ， 

a— 1 

= 浠 11 P^22P 1 * 

The vector 一 5 ⑴一 - i (2) ) is the residual of xf from its regression 

on x^ ] and 1. The partial correlations are simple correlations between these 
residuals. The definition can be used also when the distributions involved are 
not normal. 

Two geometric interpretations of the above theory can be given. In 
p-<Jinicnsional space represent N points. The sample regression 

function 

(12) x {[) =jr th 4- P(x (2> -x {2) ) 

is a (p - ^^dimensional hyperplane which us the intersection of ^ (p - 1)- 
dimensional hyperplanes, 

p ^ 

(13) \=5,+ [ ('— 5 ; ), i = l,...,q, 

- 

八 八 A. . 

where x n x } are running variables、Here 氏 ； is an element of P = X 12 X^ = 
A l2 A 2 z- The ith row of P is ( 成 ? +1 ， ". ， P ip ). Each right-hand side of (13) is 
the least squares regression function of x f on x q+lf ... f x p ; that is, if we 
project the points x [9 ..^x N on the coordinate hyperplane of x h x q+l ,---, x p , 
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2 



Figure 4.4 


then (13) is the regression plane. The point with coordinates 

p ^ 

(14) ^/=i/+ E A；(^；a i = l,...,q, 

f 1 

Xj=X Ja , /• = ? + 1 ，."，/?， 

is on the hyperplane (13). The difference in the ith coordinate of x a and the 
point (14) is ^x /cr - [5^ + + , P tJ {x Ja -Xj)] for i = and 0 for 

the other coordinates. Let y[ t ^ - These points can be repre¬ 

sented as N points in a ^-dimensional space. Then A u . 2 = E^i 

We can also interpret the sample as p points in N-space (Figure 4.4). Let 
二 （气 1 ， … ，文 /w)' be the jih point, and let e - (1,..., 1)' be another point. 
The point with coordinates x j7 .. ty x f is x { z. The projection of M f on the 
hyperplane spanned by u q + l9 ... 7 u p , e is 

(15) + E 

卜 <?+ l 

this is the point on the hyperplane that is at a minimum distance from u r Let 
uf be the vector from u { to u l9 that is, u { -u l9 or, equivalently, this vector 
translated so that one endpoint is at the origin. The set of vectors w*,,.., w* 
are the projections of u x ,... y u q on the hyperplane orthogonal to 
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u q + u p , e. Then u*'u* = the length squared of uf (i.e., the 

square of the distance of u from u,). Then uf 'u*/-^uf'ufuf'u* = r ； j. q + u iA) 
is the cosine of the angle between u* and u*. 

As an example of the use of partial correlations we consider some data 
[Hooker (1907)] on yield of hay (X : ) in hundredweights per acre, spring 
rainfall (X 2 ) in inches, and accumulated temperature above 42°F in the 
spring (X 3 ) for an English area over 20 years. The estimates of ii t , a, 
(=and are 


(16) 


1 28.02 1 
jx = Jc = 4.91 

594 

\ 



'^1 


4.42 1 





1.10 

y 


(氕 


85 


'1 Pn 

Pl3^ 


f 1.00 

1 0.80 -0,40\ 

P21 1 

P23 


0.80 LOO -0,56 

^ P31 P32 

1 , 


-0.40 -0.56 1.00 


From the correlations we observe that yield and rainfall are positively 
related, yield and temperature are negatively related, and rainfall and tem¬ 
perature are negatively related. What interpretation is to be given to the 
apparent negative relation between yield and temperature? Does high tem¬ 
perature tend to cause low yield, or is high temperature associated with low 
rainfall and hence with low yield? To answer this question we consider the 
correlation between yield and temperature when rainfall is held fixed; that is, 
we use the data given above to estimate the partial correlation between X x 
and with X 2 held fixed. It is t 


(17) r,, 2 = =0.097. 

Thus, f the effect of rainfall is removed，yield and temperature are positively 
correlated. The conclusion is that both high raninfall and high temperature 
increase hay yield, but in most years high rainfall occurs with low tempera¬ 
ture and vice versa. 


^We compute with 2 as if it were 2. 
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43«2. The Distribution of the Sample Partial Correlation Coefficient 

In order to test a hypothesis about a population partial correlation coefficient 
we want the distribution of the sample partial correlation coefficient* The 
partial correlations are computed from A u . 2 = ^11 ^\2^22^2i ^ aS indicated 

in Theorem 4.3.1) in the same way that correlations are computed from A. 
To obtain the distribution of a simple correlation we showed that A was 
distributed as T ， aZ\Z a Z f a , where Z u •… are distributed independently 
according to MO, X) and independent of X (Theorem 3.3.2), Here 
we want to show that is distributed as p ~ <i) U a U f a . where 

p — q 、are distributed independently according to MO. ^ u . z ) 
and independently of p. The distribution of a partial correlation coefficient 
will follow from the characterization of the distribution of A {1 v We state the 
theorem in a general form ； it will be used in Chapter 8, where wc treat 
regression in detail. The following corollary applies it to A [{ expressed in 
terms of residuals. 

Theorem 4*3*3* Suppose F l? ..., Y m are independent with Y a distributed 
according to N{T ^), where w a is an rcomponent vector. Let H — E 二 
assumed nonsingular^ G = and 

(18) C 二 £ {y a -Gw a ){Y a ~Gw tt y^ £ Y a Y'-GHG'. 

a=1 a—1 

Then C is distributed as T/^: r 'U a U:，where U { ,.,, f U m _ r are independently 
distributed according to JVXO, 伞 ） and independently of G. 

Proof. The ro^s of Y = Y m ) are random vectors in an m-dimen- 

sional space, and the rows oiW = ,,w m ) are fixed vectors in that space. 

The idea of the proof is to rotate coordinate axes so that the last r axes are 
in the space spanned by the rows of W. Let E 2 = FW, where F is a square 
matrix such that FHF r - L Then 

m 

(19) E 2 E , 2 = FWW'F , X) ^ a < F， 

I 

= fhf=/. 

Thus the m-component rows of E 2 are orthogonal and of unit length. It is 
possible to find an (m — r) Xm matrix E x such that 



(20) 
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is orthogonal. (See Appendix, Lemma A.4.2.) Now let U = YE 1 (i.e., U a = 
By rheorem 3.3.1 the columns of U — (C/ p ..., U m ) are indepen- 
deutlv and normally distributed, each with covariance matrix 办 . The means 
are given by 

(21) <gYE r ^TWE 1 

^TF^ [ E 2 (E\ E 2 ) 

=(0 r ， f” 1 ) 

by orthogonality of E. To complete the proof we need to show that C 
transforms to We have 


(22) 

Ly.y ： 

a =* l 

= YY = 

UEE'U' = UU '= 

: E U a U' a . 

a == 1 

Note that 





(2?) 

G 

= YW'H 

■' = VEE\{F- X )'F'F 




£；F 






where (/(:、 

= ( U m ^ r4 . i ，… 

.，(/„), Then 


(24) 

GHG' = U (2) FHF'U (1), = (/ (2) f/ (2), = 

E u a u' a 


Thus C is 

( 乃 ) £ y,X - ghc = £ W- £ u tt u;, = m t u tt u; x . 

a 1 ct I a - "i - r 卜 l cr — l 

This proves the theorem. ■ 

It follows from the above considerations that when r = 0, the SV — 0, and 
we obtain the following: 

Corollary 4.3.1. If r = 0 ? the matrix GHG 1 defined in Theorem 4 . 3.3 is 
distributed as r+l U a U^ 9 where t/ w ^ r+I ,..., t/ m are independently dis¬ 

tributed, each according to MO, ^>). 
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We now find the distribution of A {i . 2 in the same form. It was shown in 

Theorem 3.3.1 that A is distributed as L^l}Z a Z , a9 where Z x . are 

independent, each with distributioi N(0, X X Let Z a be partitioned into two 
subvectors of q and p - q components, respectively: 



Then A i} = By Lemma 4,2.1, conditionally on Z( 2) = 

z ( , 2) ,.,=z^Lj, the random vectors are independently 

distributed, with distributed according to where p = 

lynlnn and = S n - SioXfo 1 Now we apply Theorem 43.3 with 

Z^ = Y ai z^^w ai N-l-m,p-q = r^ = r y l {l . 2 = ^ 9 A n =^ ： lYX^ 

A u A ^2 = G y — We find that the conditional distribution of A u — 
(A n A^ 2 )A 12 (A 22 ^ r n) =A \b 2 g* vei1 z l 2) a = l”" ， iV- 1， is that of 

^ N aZ\~ {i) ~ q) U a U f a , where C ， … 為 一卜 (are independent, each with dis¬ 
tribution MO, Since this distribution does not depend on {z^)> we 

obtain the following theorem: 

Theorem 4.3.4, The matrix A u 2 = A n - ^ 12 ^ 22^21 ^ distributed as 
^Zl^ p ~ q) U a U^ where ，…， p _ q 、are independently distributed, each 
according to iV(0, X n . 2 )> and independently of A l2 and 4 22 , 

Corollary 4.3.2. If S l2 = 0 (p = 0)， then A l{ . 2 is distributed as 
p ~ q) U a U^ and is distributed as 以 1'" 「 JW，where 

£/ 【，， -• ， U N _' are independently distributed，each according to N(0, X n . 2 X 

Now it follows that the distribution of r ( j. q + l p based on N observations 
is the same as that of a simple correlation coefficient based on N — {p - q) 
observations with a corresponding population correlation value of + 1 

Theorem 4.3.5. If the cdf of r (J based on a sample of N from a normal 
distribution with correlation is denoted by F(r\N y p^\ then the cdf of 
the sample partial correlation r l} i/ + l p based on a sample of N from a 
normal distribution wkh partial correlation coefficient h u p is F[r\N — 
(P - q) ， P/j rj + i, 

This distribution wax derived by Fisher (1924), 

4*3_3, Tests of Hypotheses and Confidence Regions for Partial 
Correlation Coefficients 

Since the distribution of a sample partial correlation r iyq 代 • . p based on a 
sample of N from a distribution with population correlation + A p 
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equal to a certain value, p ， say, is the same as the distribution of a simple 
correlation r based on a sample of size N — （p — q) from a distribution with 
the corresponding population correlation of p, all statistical inference proce¬ 
dures for the simple population correlation can be used for the partial 
correlation, The procedure for the partial correlation is exactly the same 
except that N is replaced by N - （p - q)，To illustrate this rule we give two 
examples. 


Example L Suppose that on the basis of a sample of size N we wish to 
obtain a confidence interval for 【，… ,〆 The sample partial correlation is 
+ 1 >p . The procedure is to use David’s charts for N -{p - q). In the 

example at the end of Section 4.3.1，we might want to find a confidence 
interval for p I2 . 3 with confidence coefficient 0.95. The sample partial correla¬ 
tion is r l2 . 3 = 0.759. We use the chart (or table) for W _ (p — 分 ）= 20 — 1 = 19* 
The interval is 0.50 < p I2 . 3 < 0,88, 


Example 2. Suppose that on the basis of a sample of size N we use 
Fisher’s : for an approximate significance test of + " = p 0 against 

two-sided alternatives. We let 


(27) 


z=!log 
“ =pog 


1 + …， p 

1 一 

1 + Pq 
1 一 Po 


Then — ( p — q) — 3 (z — ( 0 ) is compared with the significance points of 
the standardized normal distribution. In the example at the end of Section 
4.3.1 ， we might wish to test the hypothesis p l3 . 2 = 0 at the 0,05 level. Then 
^ 0 = 0 and V20^T^3 (0.0973) = 0.3892. This value is clearly nonsignificant 
(10.38921 <1.96 )， and hence the data do not indicate rejection of the null 
hypothesis. 

To answer the question whether two variables and x 2 are related when 
both may be related to a vector x( 2 ) = (x 3 ， ." ， x#)，two approaches may be 
used. One is to consider the regression of x x on x 2 and and test whether 
the regression of a: r on x 2 is 0. Another is t) test whether ^ 0* 

Problems 4.43-4.47 show that these approaches lead to exactly the same test. 


4*4. THE MULTIPLE CORRELATION COEFFICIENT 

4.4.1* Estimation of the Multiple Correlation Coefficient 

The population multiple correlation between one variate and a set of variates 
was defined in Section 2.5. For the sake of convenience in this section we 
shall treat the case of the multiple correlation between X x and the vector 
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X (2) = (X 2 ,^ ^X p ) f ; we shall not need subscripts on R. The variables can 
always be numbered so that the desired multiple correlation is this one (any 
irrelevant variables being omitted). Then the multiple correlation in the 
population is 


( 1 ) 


■^二 P ； ^22P / P ^22^ = / ^( 1)^22 

一 V — 一 — V ^ii 


where p ， 0 *(”，and 2 22 arc defined by 


( 2 ) 

( 3 ) 


a n 

<T (I) ^22 

^ = ^22^(1) 



Given a sample x l ,...,x N {N >p\ we estimate S by S = [N/{N — 1)]S or 


( 4 ) = ^ YL (^ a -x)(x a -x)' 

1 a=l 

A a 丨 

and we estimate p by P ^ ^22 ^(1) =A 22 We define the sample multiple 
correlation coefficient by 


A N f 

CT ⑴ 


( 5 ) 


That this is the maximum likelihood estimator of R is justified by Corollary 
3.2.1, since we can define 尺， o ■ ⑴， S 22 as a one-to-one transformation of X. 
Another expression for R [see (16) of Section 2.5] follows from 





The quantities R and p have properties in the sample that are similar to 
those R and p have in the population. We have analogs of Theorems 2.5.2, 
2.5.3, and 2.5.4. Let x la = x l + ^\x { ^ ) -x {2) \ and x^ a =x ia -x ia be the 
residual. 


Theorem 4.4.1. The residuals x^ a are uncorrelated in the sample with the 
components of x ( ^\ a = 1， • • • ， / V. For every vet. tor a 

⑺ E [ x ia -xi - p 7 (4 2) -i (2) )] 2 < £ 卜(文 m]' 

a= l a== 1 
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The sample correlation between x Ut and a = 1,,.., N, is maximized for 

c == p. and that maximum correlation is R. 

Proof, Since the sample mean of the residuals is 0, the vector of sample 
covariances between x* Q and x { j ] is proportional to 

㈧ I ： [(.V la --V,)- - p^ 22 = 0. 

a = 1 

The right-hand side of (7) can be written as the left-hand side plus 

(” £ [(合-卟(€)-沪 )] 2 

a® 1 

11* 1 

which is 0 if and only if a = p. To prove the third assertion we consider the 
\ector a for which — 5 (2) )] 2 = [ 含 ’ (z( a 2) -if (2) )] 2 , since the 

correlation is unchanged when the linear function is mulitplied by a positive 
constant. From (7) we obtain 

(10) «u-2 E + £ [Am 2 

I cr= 1 

-2 E \K(e-5 l2) ) + E i (2) )] 2 ， 

a » 1 a：« 1 

from which we deduce 



Thus Xj + p'C-r^ — i (2) ) is the best linear predictor of x Xa in the sample, 
and is the linear function of x { ^ that has maximum sample correlation 
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with x la . The minimum sum of squares of deviations [the left-hand side of 

⑺] is 

(12) E [(^a -^) - P ( (4 2) -^ (2) )] 2 =«n - 

a=l 

~ a H 2 

as defined in Section 4.3 with q = 1. The maximum likelihood estimator of 
cr lIt2 is d\i 、2 = a u 、 i/N. It follows that 

(13) 

Thus 1 -i? 2 measures the proportional reduction in the variance by using 
residuals. We can say that R 2 is the fraction of the variance explained by jc ( 2) . 
The larger R 2 is, the more the variance is decreased by use of the explana¬ 
tory variables in jc (2) . 

In p-dimensional space jc n represent N points. The sample regres¬ 

sion function x l =x l + p r (jc (2) -Jc (2) ) is the (p - l)-dimensional hyperplane 
that minimizes the squared deviations of the points from the hyperplane, the 
deviations being calculated in the a:(- direction. The hyperplane goes through 
the point Jc. 

In N^dimensional space the rows of (x {y ... y x N ) represent p points. The 
N-component vector with ath component x ia —x t is the projection of the 
vector with ath component x ia on the plane orthogonal to the equiangular 
line. We have p such vectors; fl’Ocf) - Jc (2) ) is the ath component of a vector 
in the hyperplane spanned by the last p - 1 vectors. Since the right-hand side 
of (7) is the squared distance between the first vector and the linear 
combination of the last p - 1 vectors, p 7 (jc^ 2) -x (2) ) is a component of the 
vector which minimizes this squared distance. The interpretation of (8) is that 
the vector with ath component (x la -x t ) - 0\x ( a 2) - x (2) ) is orthogonal to 
each of the last p - 1 vectors、Thus the vector with ath component 合 ’ (jef) 一 
jc ( 2) ) is the projection of the first vector on the hyperplane. See Figure 4 、 5. 
The length squared of the projection vector is 

(14) E [^(4 2) -X (2) )] 2 = 

a^l 


and the length squared of the first vector is E^ =1 (jt: la — = a u . Thus R is 

the cosine of the angle between the first vector and its projection. 
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In Section 32 we saw that the simple correlation coefficient is the cosine 
of the angle between the two vectors involved (in the plane orthogonal to the 
equiangular line). The property of R that it is the maximum correlation 
between x la and linear combinations of the components of corresponds 
to the geometric property that R is the cosine of the smallest angle between 
the vector with components x [a and a vector in the hyperplane spanned 
by the other p - \ vectors. 

The geometric interpretations are in terms of the vectors in the (N - 1)- 
dimensional hyperplane orthogonal to the equiangular line. It was shown in 
Section 3.3 that the vector (x lY in this hyperplane can be 

designated as , z ( N _ { \ where the z ia are the coordinates referred to 

an (N- l)-dimensional coordinate system in the hyperplane. It was shown 
that the new coordinates are obtained from the old by the transformation 
z ia = a= I ， … ， N ，where B = (b a ^) is an orthogonal matrix 

with last row (1 / y/N 9 , 1/yfN), Then 

N N-\ 

(15) — 二 （ — 文 i*)( 文 /a — \ _ 二 ^ia^ja * 

a=1 a =1 

It will be convenient to refer to the multiple correlatior defined in terms of 
as the multiple correlation without subtracting the means. 

The population multiple correlation R is essentially the only function of 
the parameters (ji and 2 that is invariant under changes of location, changes 
of scale of and nonsingular linear transformations of X (2 \ that is, 
transformations X* = cX x + d 9 X( 2 )* = CX (2) Similarly, the sample multi¬ 
ple correlation coefficient R is essentially the only function of x and 2, the 
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sufficient set of statistics for |jl and X, thal is invariant under these transfor¬ 
mations. Just as the simple correlation r is a measure of association between 
two scalar variables in a sample, the multiple correlation R is a measure of 
association between a scalar \ ariable and a vector variable in a sample. 


4,4.2. Distribution of the Sample Multiple Correlation Coefficient 
When the Population Multiple Correlation Coefficient Is Zero 

From (5) wc have 


(16) 

then 


R 2 = a \l) A ^ a j[) _ 
n ， 


(17) 

and 

(18) 


a {\)^71 a {[) __ a \\ ~ a \\)^2i a {\) _ a[[ 2 

a u ' ^7i _ ^7 

R 2 = a \\Alz a i\) 
l -R 1 ~ ~ ^ 


For 9=1, Corollary 4.3.2 states that when P = 0, that is, when 尺 =()• 2 is 

distributed as and ⑴ is distributed as where 

h ， … ， V N 一 ' are independent, each with distribution /V(0, 2 ). Then 

i and a[ {) A^ 2 ] a {[) /(T i{ 2 are distributed independently as .v^aria- 
bles with N - p and p - 1 degrees of freedom, respectively. Thus 


(19) 


R 2 N - p — a {])^ 22 a (])/ a n 2 N-p 

1 -R 2 . P - 1 = ~ 2 /^ 112 ~• P~ l 


Xp-i N — p 
Xn- p P~ X 


Ff ， - \ 、 -p 


has the ^-distribution with p - 1 and N - p degrees of freedom. The densit> 
of F is 

( 20 ) 

__m 1 !!___ 广、 
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Thus the density of 


( 21 ) 


is 


R = 



N — p 〜 p 

1 + 卩 一 1 f 

N -p p- [ ' n ~p 


( 22 ) 


7 一 r [K iV ~ !)) 
nkp-^Mi^-p)} 


Rp-^(i~R^-y (N - p) ~\ 


0<R^l. 


Theorem 4.4.2. Let R be the sample multiple correlation coefficient [de¬ 
fined by (5)] between X v and X {2)f = ( 义 2 ， … ， X p ) based on a sample of N from 
M(x 5 X). If R-Q [that is, if (cr l2 ， … ， cr lA? )'= 0 = P ]，then [ 尺 2 /(l — 尺 2 )h 
[{N - p)/( p — 1)] is distributed as F with p — l and N —p degrees of freedom. 

It should be noticed that p - 1 is the number of components of X (2) and 
that N -- p = N — （ p — i) 一 h If the multiple correlation is between a compo¬ 
nent X t and q other components, the numbers are q and N — q - L 

It might be observed that R 2 /{\ - R 2 ) is the quantity that arises in 
regression (or least squares) theory for testing the hypothesis that the 
regression of on X 2 , …， is zero. 

0, the distribution of R is much more difficult to derive. This 
distribution will be obtained in Section 4.43. 

Now let us consider the statistical problem of testing the hypothesis 
H: R = 13 on the basis of a sample of N from [/? is the population 

multiple correlation between and {X 2i …， X p ),] Since 尺 > 0, the alterna¬ 
tives considered are R > 0. 

Let us derive the likelihood ratio test of this hypothesis. The likelihood 
function is 


(23) L«.S*) 


(2tt) 一 |S*|# 


exp 


^ E (x a -^yx*-\x a -^) 


The observations are given; L is a function of the indeterminates Let 

to be the region in the parameter space fl specified by the null hypothesis. 
The likelihood ratio criterion is 


(24) 


A = 


max 


max 


△w) • 
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Here Cl is the space of tt*, X* positive definite, and w is the region in this 
space where ^ 0, that is, where = 0, 

Because is positive definite, this condition is equivalent to o ■ ⑴ =0. Tlie 
maximum of 2*) over Cl occurs at ^ji* = jx = x and 2 * = 2 = (l/N)A 
= (l/A0E2 = i(x a -50U a -i) f and is 


(25) 


A/ - 1P N o\P N 


In a) the likelihood function is 


(26) 




E ( x i a ~ 




exp 




The first factor is maximized at fi* = Ai and = cr^ = (1 /N)a u ， and 

the second factor is maximized at = pi (2) = x (2) and 2*2 ― 2 ■: 

(l/N)A 22 - The value of the maximized function is 


f 22 


(27) max 1(0*) = - rrj—r - n~ ttt? - 

jx'rw(277-(2 t0 5(p ) U 22 l ⑼ 

Thus the likelihood ratio criterion is [see (6)] 


(28) 




(1 - 尺 2 r. 


The likelihood ratio test consists of the critical region A < 入 0 , where A 0 is 
chosen so the probability of this inequality when 尺 = 0 is the significance 
level a. An equivalent test is 

(29) l-\ 2/N =R 2 >l-\y N . 

Since [R 2 /(l - R 2 )][(N ^p)/(p ~ 1)] is a monotonic function of R, an 
equivalent test involves this ratio being larger than a constant. When 尺二 0, 
this ratio has an F p _ t ^.^-distribution* Hence, the critical region is 

( 30 ) … (a )， 

where F p ^ Vy N _ p {a) is the (upper) significance point corresponding to the a 
significance level. 
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Theorem 4.43. Given a sample x l9 ... 9 x N from iV(|x ， S )， the likelihood 
ratio test at significance level a for the hypothesis 尺 = 0 ， where R is the 
population multiple correlation coefficient between and (X 29 ^.. 9 X p ), is given 
by (30), where R is the sample multiple correlation coefficient defined by (5). 


As an example consider the data given at the end of Section 4.3.1. The 
sample multiple correlation coefficient is found from 


(31) 1 一尺 2 = 


1 

r 12 f 

•13 

fll 1 f 

•23 

r 3l r 32 

1 


1 



r n 1 



1.00 

0.80 

-0.40 

0.80 

1.00 

— 0.56 

-0.40 

-0.56 

1.00 


1.00 -0.56 

— 0.56 1.00 


-0.357. 


Thus R is 0.802. If we wish to test the hypothesis at the 0.01 level that hay 
yield is independent of spring rainfall and temperature, we compare the 
observed [R 2 /(l - R 2 )][(20 - 3)/(3 - D] = 15.3 with ^ 17 (0.01) = 6.11 and 
find the result significant; that is, we reject the null hypothesis. 

The test of independence between and {X 29 ... 9 X p ) is equiva¬ 
lent to the test that if the regression of X { on jc ( 2) (that is, the conditional 
expected value of X { given X 2 =x 29 . t . 9 X p =^) is fx { + 卩乂 〆 2 ) - jjl ( 2) ), the 
vector of regression coefficients is 0. Here P=^ 22 lfl (i) lS the usual least 
squaies estimate of p with expected value p and covariance matrix o - yl , 2 ^22 
(when the are fixed), and a lhl /(N -p) is the usual estimate of o- n . 2 . 
Thus [see (18)] 

⑽ 尺 2 . N-p = P\4 22 p N-p 

t ) 1 —R 2 P — 1 ^ii2 f- 1 

is the usual 尸 -statistic for testing the hypothesis that the regression of X x on 
x 29 ... y x p is 0. In this book we are primarily interested in the multiple 
correlation coefficient as a measure of association between one variable and 
a vector of variables when both are random. We shall not treat problems of 
univariate regression. In Chapter 8 we study regression when the dependent 
variable is a vector. 


Adjusted Multiple Correlation Coefficient 

The expression (17) is the ratio of a u 2 , the sum of squared deviations from 
the fitted regression, to a i{9 the sum of squared deviations around the mean. 
To obtain unbiased estimators of cr,, when p = 0 we would divide these 
quantities by their numbers of degrees of freedom, N — p and N — l 9 
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respectively. Accordingly we can define an adjusted multiple correlation coeffi¬ 
cient R* by 


( 33 ) 


1-R* 2 = 


a m/(N -p) 
礼 /(~- 




which is equivalent to 

( 34 ) R-~ = R^ - 

This quantity is smaller than R 2 (unless p = 1 or 尺 2 = 1). A possible merit to 
it is that it takes account of p\ the idea is that the larger p is relative to N, 
the greater the tendency of R l to be large by chance. 


4.4.3. Distribution or the Sample Multiple Correlation Coefficient When the 
Population Multiple Correlation Coefficient Is Not Zero 

In this subsection we shall find the distribution of R when the null hypothe¬ 
sis 尺 = 0 is not true. We shall find that the distribution depends only on the 
population multiple correlation coefficient R. 

First let us consider the conditional distribution of R 2 /{\ - R 2 ) = 
a [\)^ii a (\)/ a \\^ given Z^ 2) = z^ 2) , Under these conditions 

Z ll? ...,Z ln are independently distributed, Z l(1 according to cr u 2 X 

where p = ■⑴ and cr, h2 = (r n - cr (1) . The conditions are those 

of Theorem 4.3.3 with Y a = Z ]a , V - p\ w a = r — p-1, 中 = (7^， 
m = Then a U 2 = a i{ — (” corresponds to iY lt Y f x - GHG\ and 
fl u« 2 / cr u 2 has a 尤 2 -distribution with n — {p - \) degrees of freedom. 
a (l)^22 a (\) = ^22 a ([)y^22^22 a (l) ) corresponds to GHG' and is distributed 
as E a t/ a 2 , a= n - {p - 0 十 1”. •，小 where Var(t/ a ) = o-,,. 2 and 

( 35 ) ■ ^(u n _ p+2 ,...,u n )-rF-\ 

where FHF' = I [H = F~\F'y 1 ]. Then o-\\)^ 22 a \i)/^u i ' s distributed as 
2 ) 2 , where VarCt/y-/o^7) = 1 and 


( 36 ) 


E 

a = n~p + 2 




\2 


CT| 


rF-^rF-'y 


THT 


P々P 

a ] 1-2 


Thus (conditionally) 2 has a non cent nil ■'-disirihutioi'i w it li 
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p - 1 degrees of freedom and noncentrality parameter P'/l 22 p/(r u . 2 . (See 
Theorem 5.4,1.) We are led to the following theorem: 

Theorem 4.4.4. Let R be the sample multiple correlation coefficient between 
X (1) and X {2), = (A%,..., X p ) based on N observations (x lu jcf))， … ， （ x lJsh jcg)). 
The conditional distribution of [R 2 /(l ~ R 2 )][N — p)/(p — 1)] given x ( a 2) fixed 
is noncentral F with p — 1 and N — p degrees of freedom and noncentrality 
parameter ^A 22 ^/^\y. 2 ' 

The conditional density (from Theorem 5.4,1) of F = [R 2 /(1 - R 2 )][(N - 
p)/{p- 1)] is 


n? v (P- l) ex pt~lP^22P/^u- 


fP^22P 


L 


「11.2 


㈣ 


去 (p —l) + or- 


r[^( AT— 1) + a 


Or !r[ |( p — \ ) a 


(p 一 1)/ 

N - p 




and the conditional density of W = R 2 is {df— [{N — p)/{p — 1)](1 — 
iv)' 2 dw) 


(3S) 


cxp[ —■^p ; /l 22 p/cr| 1 /1 


■e 

a =^=0 


P^ 22 pl 

2(r U2 j 


w^P- iUa ~ l r[\^N-l) + a] 

air[4(p- 1) +a] 


To obtain the unconditional density we need to multiply (38) by the density 

of Zp_，Z?) to obtain the joint density of W and Z ( 【 2 )， …， Zf) and then 

integrate with respect to the latter set to obtain the marginal density of W. 
We have 


(39) 


P ^ 22 p _ 


(Ti 


2 
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Since the distribution of Z^ 2) is N(0, S 22 ), the distribution of P'Z^/ ^o- n . 7 
is normal with mean zero and variance 


(40) 


S 


' P ' zL 2 ) 

i ^U'T. 


°" 11.2 

PS 22 P 


P ， 2 22 P/o-ii 


一 P ^22p 1 - P ， ^22p/ <J l 

R 2 


l-R^ 


Thus (p / A 22 p/cr, ll2 )/[/? 2 /(l — R 2 )] has a /-distribution with n degrees of 
freedom. Let R 2 /(l — R 2 ) = 0. Then P ， ^ 2 2 P/°'ii -2 = compute 


(41) 


u a e~^ u 


1 e 


du 


^ 广 1 

2^T{\n) 


<j} a r(|n + a) 


2^T(\n) 

+ J ^ ^ 3 (1 + )w 


u^ n + a ~ l e~^ u du 


一 （1 + 0) 士 n + " r(|«) J Q 2^ n + tt r(^ + a) 

♦ a r(^ + a) 

一 (\ + <j)y n+a r(^n) _ 

Applying this result to (38), we obtain as the density of R 2 

(…)(卜妒少 - (炉广 ( F ) 知 - n + M - Ir2( ^ + M 

、 ） r[l( n -p + i)Jr(^) ^0 fi!r[j(p-i) + t，] 

Fisher (1928) found this distribution. It can also be written 


r ( K )( U ): 




(43) •二 )M “ ， (們 卜巧 

■F[^n,{n-,\(p-l);R 2 R 2 ], 

where F is the hypergeometric function defined in (41) of Section 4.2, 



156 


SAMPLE CORRELATION COEFFICIENTS 


Another form of the density can be obtained when n — p + 1 is even. We 
have ‘ 


㈣ | 0 ^rig^ 




dt 


去 rt +/x— 1 


<? \ i(，， - p + 1) i._, ^ (tR 2 R 7 f r(in + ju.) 


dt 


1 E 




蚪！ 


r(H 


r(K) 


g \ -p + l ) 




The density is therefore 


(45) 


(1 一及 2 ) in (尺 2 ) 如 _ 3 )(1 -尺 2 ) 士 (" t 1 ) 


r [士 (n —/? + 1) 


d v(n P+l) ^ n - l (l-tR 2 R 2 )~ {n 


dt 


Theorem 4,4.5, The density of the square of the multiple correlation coeffi¬ 
cient^ R 2 , between X { and X 2 , …， X p based on a sample of N = n + l is given 
by (42) or (43) [or (45) in the case of n —p + l even], where R 2 is the 
corresponding population multiple correlation coefficient. 

The moments of R are 


(46) SR h 


( i-ir 




(^ 2 )V(|n + A t) 


ru(n -p + i)]r(in) M -e 0 r[|(p -1) + 

./V - 尺 2 ) 如卞 n_I ( 尺 2 严“画 1 )… 1 d{R 2 ) 


(1-打 

r(i„) 


E 

M = 0 


(尺 2 ) r 2 (^n + ( p + h - l) + 1^ 

^r[^ P ^i) + ix]r[^n + h)+uL~ 


The sample multiple correlation tends to overestimate the population 
multiple correlation. The sample multiple correlation is the maximum sample 
correlation between x l and linear combinations of x (2) and hence is greater 
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than the sample correlation between x { and p x (2) ; however, the latter is the 
simple sample correlation corresponding to the simple population correlation 
between x } and p'x ⑺， which is R, the population multiple correlation. 

Suppose is the multiple correlation in the first of two samples and p, 
is the estimate of P; then the simple correlation between and in 

the second sample will tend to be less than R } and in particular will be less 
than R 2 , the multiple ccrrelation in the second sample. This has been called 
“the shrinkage of the multiple correlation.” 

Kramer (1963) and Lee (1972) have given tables of the upper significance 
points of R, Gajjar (1967)，Gurland (1968), Gurland and Milton (1970)， 
Khatri (1966)，and Lee (1917b) have suggested approximations to the distri¬ 
butions of E 2 /{\ - R 2 ) and obtained large-sample results. 


44,4, Some Optimal Properties of the Multiple Correlation Test 


Theorem 4.4.6. Given the observations x [9 x N from 2), of all tests 
of R = 0 at a given significance level based on x and A — 【 (jc 。 —xXx a —x) f 

that are invariant with respect to transformations 


(47) 


= Cx {2) +d. 


c 2 a iU 




cCa 


⑴， 


A* 22 = CA 22 C\ 


any critical rejection region giuen by R greater than a constant is uniformly most 
powerful. 


Proof The multiple correlation coefficient R is invariant under the trans¬ 
formation, and any function of the sufficient statistics that is invariant is a 
function of R. (See Problem 4.34.) Therefore, any invariant test must be 
based on R. The Neyman-Pearson fundamental lemma applied to testing 
the null hypothesis R = 0 against a specific alternative R = R 0 >0 tells us the 
most powerful test at a given level of significance is based on the ratio of the 
density of R for R - R 09 which is (42) times 2 尺 [because (42) is the density of 
R 2 \ to the density for 尺 = 0, which is (22). The ratio is a positive constant 
times 


(48) 


f (切？ ( h ) 


Since (48) is an increasing function of R for R> 0, the set of R for which 
(48) is greater than a constant is an interval of R greater than a constant. 
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Theorem 44.7. On the basis of observations x ” … ， x N from o/ 

all tests ofR = 0 at a given significance level with power depending only on R，the 
test with critical region given by R greater than a constant is uniformly most 
powerful 

Theorem 4.4.7 follows from Theorem 4.4.6 in the same way that Theorem 
5.6.4 follows from Theorem 5.6.1. 

4.S. ELUPTICALLY CONTOURED DISTRIBUTIONS 
4.5,1. Observations Elliptically Contoured 

Suppose Xj__ x N are N independent observations on a random p-vector X 

with density 

(1) |Ar'g[(A ： -v) ( A- 1 (A ： -v)]. 

The sample covariance matrix S is an unbiased estimator of the covariance 
matrix X - [ o'R 2 /p] \, where R 2 — {X — v)' \ ^ 1 {X — v) and S'R 2 < oo. An 
estimator of 〜 = ( 、 / 和 = 入 ,/ 众入 ,, A" is r t] = s tj / y[s^J jr i, j = 

1. p. The small-sample distribution of r XJ is in general difficult to obtain, 

but the asymptotic distribution can be obtained from the limiting normal 
distribution of /N(S -1) given in (13) of Section 3.6. 

First we prove a general theorem on asymptotic distributions of functions 
of the sample covariance matrix S using Theorems 4.2.3 and 3.6.5. Define 

(2) s — vec 5, or = vec S. 

Theorem 4.5.1. Let f{s) be a vector-valued function such that each compo¬ 
nent of f(s) has a nonzero differential at s = a. Suppose S is the covariance of a 
sample from (1) such that S"R A < oo. Then 

(3) -/(or)] = - or) + 0 ,( 1 ) 

4斗，警 [2(1 + 响叫 + _，](^1)} 

Corollary 4.5.1. If 

( 4 ) f( cs ) =f{s) 

for all c> 0 and all positive definite S and the conditions of Theorem 4.5.1 hold 、 
then 

(5) V W[/ ⑷- /(or) ] + • 
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Proof, From (4) we deduce 


(6) 

That is, 
⑺ 


__ df{cs) _ df(cs) d(cs) _ df(cs) 

~ 3C ~ ds' dC ~ ds' 5 


d f{°) 

da' 


(T = 0. 


■ 


The conclusion of Corollary 4.5.1 can be framed as 

(8) 赢 ⑽- 


The limiting normal distribution in (8) holds in particular when the sample is 
drawn from the normal distribution. The corollary holds true if k is replaced 
by a consistent estimator k. For example, a consistent estimator of 1 + k 
given by (16) of Section 3.6 is 

(9) l + «= E 

l 

A sample correlation such as f(s) = = s”/ 如〜 or a set of such 

correlations is a function ol S that is invariant under scale transformations; 
that is, it satisfies (4). 

Corollary 4.5.2. Under the conditions of Theorem 4.5.1, 


( 10 ) 


N ( r ij- Pij) 


乂 W(0 ， 1). 


As in the case of the observations normally distributed, 


(ii) 


N 


fl 


+ K 


1 + 


2^137 


r ij 


u 


1 , 1 + Pij 

2 ] °^ 




Of course, any improvement of (11) over (10) depends on the distribution 
samples. 

Partial correlations such as + l p , i, / = 1,,. *, q y are also invariant 
functions of 5. 

Corollary 4.S.3. Under the conditions of Theorem 4.5,1, 

( 12 ) + Pil.q + l ，.… p) ^*^(0,1). 
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Now let us consider the asymptotic distribution of R 2 y the square of the 
multiple correlation, when R 2 , the square of the population multiple correla¬ 
tion, is 0. We use the notation of Section 4.4. R 2 = 0is equvialent to a (1) = 0. 
Since the sample and population multiple correlation coefficients between 
X Y and X (2) = (X 2i ..., X p ) f are invariant with respect to linear transforma¬ 
tions (47) of Section 4.4 ， for purposes of studying the distribution 
of R 2 we can assume |x - 0 and 2 =I p . In that case 4 ! ，义 ） 4 0 , and 
S 22 Furthermore, for k，i 争 1 and j — 1= l, Lemma 3.6.1 gives 

( 13 ) «%，(〖) = ( 士 + 夯)々-… 

Theorem 4.5.2. Under the conditions of Theorem 4.5.1 

(14) ]J \ + K s (i) "^(0, ^-i)* 


(15) 


Corollary 4.5.4. Under the conditions of Theorem 4.5.1 
NR 2 — d 




Xp_\ • 


4.5,2. EUiptically Contoured Matrix Distributions 
Now let us turn to the model 

(16) |A| 

based on the vector spherical model g{tr Y'Y). The unbiased estimators 
of v and X = {SR 2 /p)\ are x = {\/N)X 1 ^ n and S = (l/n)A t where A = 

Since 

(17) {X- E N v f ) f {X - E N v f ) =A ^ N(x -v)(x-v) f , 

A and x are a complete set of sufficient statistics. 

A 

The maximum likelihood estimators of v and A are v —x and A = 
{p/w g )A. The maximum likelihood estimator of p sj = = 

is P,j = a <j/ ^ a „ a jj = s >j/ }/ s a s ij (Theorem 3.6.4). 

The sample correlation r tJ is a function f(X) that satisfies the conditions 
(45) and (46) of Theorem 3.6.5 and hence has the same distribution for an 
arbitrary density g[trO] as for the normal density g[tr(*)] = const e~ ^ lr(4 \ 
Similarly, a partial correlation r. ;g + 1 p and a multiple correlation R 2 
satisfy the conditions, and the conclusion holds. 
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Theorem 4.5.3. When X has the vector elliptical density (16), (he distribu¬ 
tions of r flJ r t j q + { , and R 2 are the distnbutions derived for normally disiributcd 
observations. 

It follows from Theorem 4.5.3 that the asymptotic distributions of /* i； . 
r f/. g + i， •…/;， and R 2 are the same as for sampling from normal distributions. 

The class of left Spherical matrices Y with densities is the class of g(Y r Y\ 
Let X = YC r + e N v ’，where C f A 'C = / t that is, A = CC r . Then A" has the 
density 

(is) icrWc-v-s.o'x-^occr r . 

We now find a stochastic representation of the matrix Y. 

Lemma 4.5.1. Let V= . v \ where v l is an N-component vector. 

i = 1， • • • ， / 7. Define recursively w { = 

v'w. 

( 19 ) W E / = 2 . P- 

Let u t = w,/||w,||. Then ||uj| =1， f = 1 ，…，/?， and u^Uj = 0, / • 关 /• Further, 

(20) V= UT\ 

^here f/= ，…， w p ); t u = ||wj|, / = t . p; t i} = vy } /\\w } \\ = v.u r j - 

1， • • • ， £• - 1， / = 1， • ‘ • ，尸 ； and t fj = 0, f < 

The proof of the lemma is given in the first part of Section 7.2 and as the 
Gram—Schmidt orthogonalization in the Appendix (Section A‘5.1). This 
lemma generalizes the construction in Section 3*2; see Figure 3.1. See also 
Figure 7.1. 

Note that* T is lower triangular, U'U = I p , and V'V = TT\ The last 
equation, Q > 0, i = 1， • • • ，/?， and t s} = 0, / </， can be solved uniquely for T. 
Thus r is a function of V'V (and the restrictions). 

Let Y (N Xp) have the density 欠 （ FT), and let he an orthogonal 
NXN matrix. Then Y^ = O n Y has the density g(F*T*X Hence F* = 
O n Y=Y. Let where f* > 0, / - 1,... s /7 } and r* = 0, / <From 

- Y Y it follows that = TT and hence T* = T 7 F* = f/*7\ and 

U* = 0 N U ^ U. Let the space of U (N X p) such that U f U = I p be denoted 
OiNXpl 

Definition 4.5.1. If U (N 乂 p) satisfies U $ U = I p and O^U = U for all 
orthogonal 0 N 、then U is uniformly distributed on 0(N X p). 
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The space of U satisfying U'U = I p is known as a Steifel manifold. The 
probability measure of Definition 4.5.1 is known as the Haar invariant 
distributiotL The property 0 N U = U for all orthogonal 0 N defines the (nor' 
malizedl measure uniquely [Halmos (1956)]. 

Theorem 4.5.4, If Y {N X p) has the density giY'Y)^ then U defined by 
Y - VT\ U'U - I p , t tl > 0, i = l ， ... ， p ， and t i} = 0, i <is uniformly dis¬ 
tributed on 0( N X /?). 

The proof of Corollary 7.2.1 shows that for arbitrary g0) the density of 
r is 

(21) fl{cU(W + l-0K-^(tr7T ，)， 

I- 1 

where CC*) is defined In (8) of Section 2.7. 

The stochastic representation of Y {N Xp) with density g{Y'Y) is 

(22) Y=UT\ 

where U (N Xp) is uniformly distributed on O(NXp) and T is lower 
triangular with positive diagonal elements and has density (21). 

Theorem 4,5.5. Let f{X) be a vector-valued function of X (N Xp) such 
that 

(23) f(X + e N v f )=f(X) 
for all v and 

(24) 

for all G {p Xp). Then the distribution of f(X) where X has an arbitrary density 
(18) is the same as the distribution of f{X) where X has the normal density (18). 

Proof From (23) we find that f{X)-f{YC , \ and from (24) we find 
f{YC r ) — f(VT r C r ) — f(U\ which is the same for arbitrary and normal densi¬ 
ties (18). ■ 

Corollary 4.5.5. Let f(X) be a vector-valued function of X (iV Xp) with 
the density (18 )， where v = 0* Suppose (24) holds for all G (p Xp). Then the 
distribution of f(X) for an arbitrary density (18) is the same as the distribution of 
f{X) when X has the normal density (18). 



PROBLEMS 


163 


The condition (24) of Corollary 4.5*5 is that f(X) is invariant with respect 
to linear transformations X XG. 

The density (18) can be written as 


(25) |Cr 1 g(C- l [^+N(jc-v)(jc- V ) J ](CO _， ), 

which shows that A and x are a complete set of sufficient statistics for 
A = CC l and v. 

PROBLEMS 

4 丄 （ Sec ‘ 42.1) Sketch 




r[^(N-1)] _ 2 i w -4) 

r(^v-1)/57 、 ) 


for (a) N= 3 y (b) N= 4, (c) N= 5, and (d) N -= 10* 

4*2. (Sec. 4*21) Using the data of Problem 3*1, test the hypothesis that X x and X 2 
are independent against all alternatives of dependence at significance level 0.01. 


43. (Sec. 4.2.1) Suppose a sample correlation of 0.65 is observed in a sample of 10. 
Test the hypothesis of independence against the alternatives of positive correla¬ 
tion at significance level 0.05* 

4.4. (Sec. 4.Z2) Suppose a sample correlation of 0*65 is observed in a sample of 20. 
Test the hypothesis that the population correlation is 0.4 against the alternatives 
that the population correlation is greater than 0.4 at significance level 0*05. 

4.5. (Sec. 4.2.1) Find the significance points for testing p = 0 at the 0*01 level with 
N^15 observations against alternatives (a) p # 0, (b) p > 0, and (c) p < 0. 

4.6. (Sec. 4.2.2) Find significance points for testing p = 0.6 at the 0.01 level with 
N = 20 observations against alternatives (a) p # 0.6, (b) p > 0*6, and (c) p < 0.6* 


4.7. (Sec. 4.2.2) Tablulate the power function at p= —1(0.2) 1 for the tests in 
Problem 4.5. Sketch the graph of each power function. 

4.8. (Sec. 4.2.2) Tablulate the power function at p= — 1((X2) 1 for the tests in 
Problem 4.6. Sketch the graph of each power function. 

4.9. (Sec* 4*2.2) Using the data of Problem 3.1, find a (two-sided) confidence 
interval for p I2 with confidence coefficient 0.99. 

4.10. (Sec, 4.2.2) Suppose N = 10, r = 0.795. Find a one-sided confidence interval 
for p [of the form (r 0 , D] with confidence coefficient 0.95. 



164 


SAMPLE CORRELATION COEFFICIENTS 


4.1L (Sec. 4.2.3) Use Fisher’s z to test the hypothesis p — 0.7 against alternatives 
(.> # 0.7 at the 0.05 level with r= 0.5 and N = 50. 

4.12. (Sec. 4.2.3) Use Fisher’s z to test the hypothesis p\-p 2 against the alterna¬ 
tives P| ^ p 2 at the 0.01 level with r 1 = 0.5, = 40, r 2 = 0.6, N 2 — 40. 

4.13. (Sec. 4.2.3) Use F(sher*s z to estimate p based on sample correlations of —0.7 
(iV = 30) and of - 0.6 (iV- 40). 

4.14. (Sec. 4.2.3) Use Fisher’s z to obtain a confidence interval for ^ with confi¬ 
dence 0.95 based on a sample correlation of 0.65 and a sample size of 25. 

4.15. (Sec. 4.2.2). Prove that when N = 2 and p = 0, Pr{r= 1} = Pr{r = — 1} = 

4.16. (Sec. 4.2) Let k N (r, p) be the density of tht sample correlation coefficient r 
for a given value of p and N. Prove that r has a monotone likelihood ratio; that 
is, show that if p x > p 2 , then k N {r ， p[)/k N (r y p 2 ) is monotonically increasing in 
r. [Hint: Using (40)，prove that if 


F[j,\；n + j；{{l + pr)] = Y, c a (l + pr)° =g(r,p) 

a =»0 

has a monotone ratio, then k N (r, p) does. Show 


d 2 

dp dr 


*ogg(r,p) 


^,p^d c a cp[( a ~ fifrp+ (Qi + 爲 )1(1 +rp)°+" 一 2 

2 [^oC a (l +rp ) Q ] 2 


if (d 2 /dpdr)logg(r y p) > 0, then g(r, p) has a monotone ratio. Show the 
numerator of the above expression is positive by showing that for each a the 
sum on j3 is positive; use the fact that c a+l < \c a \ 

4,17. (Sec. 4.2) Show that of all tests of p 0 against a specific p, (> p 0 ) based on r, 
the procedures for which r> c implies rejection are the best. [Hint: This follows 
from Problem 4.16.] 

4.18* (Sec. 4*2) Show that of all tests of p = p 0 against p> p 0 based on r, a 
procedure for which r> c implies rejection is uniformly most powerful. 

4.19. (Sec. 4.2) Prove r has a monotone likelihood ratio for r > 0, p > 0 by proving 
h(r) =k N (r 9 P[)/k N (r, p 2 ) is monotonically increasing for p { > p 2 . Here h(r) is 
a constant times (I^ Br0 c a pfr a )/(I^ =0 c a p 2 r a ). In the numerator of h f (r), 
show that the coefficient of is positive. 

4.20. (Sec. 4.2) Prove that if 2 is diagonal, then the sets r l} and a lt are indepen¬ 
dently distributed. [Hint: Use the facts that r i} is invariant under scale transfor¬ 
mations and that the density of the observations depends only on the a u .\ 
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4.21. (Sec. 4.2.1) Prove that if p ^ 0 

m(/v-i)]r(Q 

Z^r[|(iV - 1) +m] 

4.22 ， (Sec. 4.2.2) Prove /|( p) and / 2 ( p) are monotonically increasing functions 
of p. 

4.23 ， (Sec. 4.2.2) Prove that the density of the sample correlation r [given by 
(38)] is - 



7T 


(1-P 2 )hi- 



x n -' dx 

(l — pr?:) ，l ^\ -x 2 


[Hint: Expand (1 - pnc)~ n in a power series, integrate, and use the duplication 
formula for the gamma function.] 

4.24. (Sec. 4.2) Prove that (39) is the density of /*. [Hint: From Problem 2.12 show 


rr 一 —審 


Then argue 






Finally show that the integral of (31) with respect to a n ( =y 2 ) and a” (= 2 2 ) is 
(39).] ~ 

4.25 ， (Sec. 4.2) Prove that (40) is the density of /*. [Hint: In (31) let a u = ue~ L and 
a 2 i = show that the density of (0 << °°) and /* (-1 </* < 1) is 


^=J-(l-p 2 )^(l-pr)- n+ ^(l — 严 - 〜 +pr)v\ 


Use the expansion 


(i -y)' 


oc 


Z 




Show that the integral is (40).] 
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sample correlation coefficients 


426. (Sec. 4.2) Prove for integer h 

. = 0-p 2 V n f (2pf^_ r 2 [i(n + i) + ff]r (/ 1 + 

^ v^r(^) ^r 0 (2/3+1)! r(hi+h + ^+i) ’ 

^ 2h _ C 1 " P 2 y n y ( 2 p 广 r 2 (?fi + 3)r(h + ^+j) 
v / 77 ： r(3n) p =0 (2 朴 r(^n +h + 

4.27. (Sec* 4.2) The i-clistribudon. Prove that if X and Y are independently dis¬ 
tributed, X having the distribution N(0 7 1) and Y having the ^ 2 -distribution 
with m degrees of freedom, then W = X/}/Y/m has the density 

m^ + o](,, 勹如。 

4m v^r(*m ) 、 m / ■ 

[Hint: In the joint density of X and Y, let x - tw^m~ ^ and iniegrate out w.] 

4.28. (Sec. 4.2) Prove 

Hin) 帥 … +1] - 

[ Him ： Use Problem 4.26 and the duplication formula for the gamma function,] 

4.29. (Sec. 4.2) Show that 4n( * I； - - p i; ), (i,;) = (1,2),(1,3),(2,3), have a joint limiN 
ing distribution with variances (1 - p}) 1 anrl covaiiances of r } j and r Ik , j 
being {(2p】 k - p" p ;Jt Xl - p} } - p} k - pj k ) + pf k . 

4.30. (Sec. 4,3.2) Find a confidence interval for p 13 . 2 with confidence 0.95 based on 
/* l3 - = 0.097 and N = 20. 

4JI. (Sec, 4.3.2) U«c Fisher's c io test the hypothesis ^ ^ ^ against -alternatives 
p 12 u 关 0 at significance level 0.01 with r, 2 . 34 - 0J4 and N = 40* 

4J2. (See. 4.3) Show that the inequality < 1 is the same as the inequality 
|r l; | > 0. where |r f； | denotes the determinant of the 3 x 3 correlation matrix. 

4.33* (See. 4.3) Inuariancc of ihc .sample partial correlation coefficient. Prove that 
r, : . 3 p is invariant under the transformations x^ a = + c i9 a ( > 0, 

i = 1,2, x^ ] * = Cx^ +*, 1,_., AT, where = (x 3a ,, x pa y, and that 

any function of x and t, that is invariant under these transformations is a 
function of /* J2<3 . pt 

434. (Sec. 4.4) Invariance of the sample multiple correlation coefficient. Prove that R 
is a function of the sufficient statistics x and S that is invariant under changes 
of location and scale of x ia and nonsingular linear transformations of (that 
is. x* a = cx Ui + rf, jc^ 2) * = Cx { ^ + d, a = 1， ‘♦ • ， W) and that every function of x 
and S ihat is invariani Is a function of R. 




PROBLEMS 


167 


4*35. (Sec. 4*4) Prove that conditional on Z la = 2 la , o ： =l”“ ， n ， R 2 /(l -R 2 ) is 
distributed like 丁 2 /(N* - l\ where T 2 — N*x f S^ l x based on N* — n observa¬ 
tions on a vector X with p* = p — l components, with mean vector 
(nc 2 = Lzf a ) and covariance matrix 2 2 2 *i = ^22 ™ ^/^u^d^u)* [Hint: The 
conditional distribution of Z^ 2) given Z la = z la is N[(] /a n )a a) z la , X 22 ^]. 
There is an « X« orthogonal matrix B whicli carries (z n ，••♦，〜„) into(c ， … ， c) 
and (Z “ ， … ， Z,„) into (Y ih Y； tli i = 2, ‘ •” p ‘ Let the new X l a be 

d …， O] 

4.36. (Sec. 4.4) Prove that the noncentrality parameter in the distribution in Prob¬ 
lem 4.35 is (a n /a n )R 2 /(l-R 2 l — 

437. (Sec. 4.4) Find the distribution of /? 2 /(1 — R 2 ) by multiplying the density of 
Problem 4*35 by the density of f/,| and integrating with respect to a [[t 

4,38, (See* 4*4) Show that the density of r 2 derived from (38) of Section 4*2 is 
identical with (42) in Section 4.4 for p ~ 2. [ Hint: Use the duplication formula 
for the gamma function.] 

4*39. (Sec. 4.4) Prove that (30) is the uniformly most powerful test of i? = 0 based 
on r. [Hint: Use the Neyman-Pearson fundamental lemma.] 

4.40. (Sec 4.4) Prove that (47) is the unique unbiased estimator of R 2 based on /? 2 . 

4.41. The estimates of |x and X in Problem 3.1 are 

x=(185J2 15U2 183.84 149*24 )'， 

'95.2933 52.8683 : 69.6617 46.1117 、 

52.8683 54.3600 : 51.3117 35*0533 

二 V)9.66lV * 5i/3*117 : *1*00.8067 * *56.5400* 

_ 46.1117 35.0533 ; 56.5400 45.0233 ^ 

(a) Find the estimates of the parameters of the conditional distribution of 
(x 3 , x A ) given {x h x 2 )； that is, find 5 21 *S,V and S 22 、' =S 22 - S 2 \S^S i2 - 

(b) Find the partial correlation r 34 .| 2 * 

(c) Use Fisher’s 2 : to find a confidence interval for p 34 . 12 with confidence 0.95. 

(d) Find the sample multiple correlation coefficients between and (XpXj) 
and between x A and (x v x 2 )^ 

(e) Test the hypotheses that x 3 is independent of (x I? x 2 ) and x 4 is indepen¬ 
dent of (x ] ,x 2 ) at significance levels 0.05. 

4^42. Let the components of X correspond to scores on tests in arithmetic speed 
(^ 1 )，arithmetic power ( 尤 2 )， memory for words (^f 3 ), memory for meaningful 
symbols (X 4 ), and memory for meaningless symbols (^ s ) 4 The observed correla- 
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SAMPLE CORRELATION COEFFICIENTS 


tions in a sample of 140 are [Kelley (1928)] 


L 0000 
0.4248 
0.0420 
0.0215 
^ 0.0573 


0.4248 

1.0000 

0.1487 

0.2489 

0.2843 


0.0420 

0.1487 

1.0000 

0.6693 

0.4662 


0.0215 

0.2489 

0.6693 

LOOOO 

0.6915 


0.0573 、 

0.2843 

0.4662 

0,6915 

LOOOO^ 


(a) Find the partial correlation between X 4 and X 59 holding X 3 fixed. 

(b) Find the partial correlation between X { and X 2 , holding X 3 , X 4 , and X 5 
fixed. 

(c) Find the multiple correlation between X { and the set X v and X 5 . 

(d) Test the hypothesis at the 1% significance level that arithmetic speed is 
independent of the three memory scores. 


4 . 43 . ( ” 4,3〕 Prove that if Pl/ , q+! ，， p * 0, then # - 2 - (p-g)r". 9+1 ，•” p / 
y 1 - r ^ q+ 1 … p is distributed according to the (-distributionwith iV— 2 - (p-q) 
degrees of freedom. 

4.44. (Sec, 4.3) Let X f =* (X h X 2y have the distribution Mjji, 2). The condi¬ 
tional distribution of X { given X 2 ^x 2 and X i2) is • 

N[^ + y 2 (x 2 ~ix 2 ) + y'(x^ 2 ，'… p ] ， 


where 



The estimators of y 2 and 7 are defined by 



Show ^2 =s=fl i 2 ' 3 ,.,. l p/ fl 22 . 3 ,. IMirU: Solve for c in terms of c 2 and the a 7 s, and 
substitute.] 

445 ， （ Sec. 4.3) In the notation of Problem 4.44, prove 

2, …, _p 4 22 、 i) - ^ a， (2) A 22 la (2)) 

^ a H^3 r s., r p ^ , p* 
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Hint: Use 


a 22 
a (2) 

4.46. (Sec. 4J) Prove that l/fl 22 . 3 … is the eiement in the upper left-hand corner 
of 

f a 22 a \l) 

^ a (2) 4 22 

4.47. (Sec. 4.3) Using the results in Problems 4.43-4.46, prove that the test for 

P 12.3 = 0 is equivalent to the usual 卜 test for = 0 . 

4.48. Missing observations. Let X — (Y r Z'Y, where Y has p components and Z has q 
components, be distributed according to 2 ), where 



a U 2 . p =a ii~ ( c 2 c， ) 











Let M observations be made on X， and N-M additional observations be made 
on K Find the maximum likelihood estimates of \x and 2. [Anderson U957X] 
[Hint: Express the likelihood function in terms of the marginal density of Y and 
the conditional density of Z given Y,] 

4.49. Suppose X is distributed according to MO, 2), where 

f i p p 2 ' 

2 = p 1 P 

U' p 1 i 

Show that on the basis of one observation, x f = (x ls x 2l ^ 3 ), we can obtain a 
confidence interval for p (with confidence coefficient 1 — a) by using as end¬ 
points of the interval the solutiore in f of 

[^l + xi( a )] t 2 - 2 (x l x 2 +x 1 x ： i )t+x^+xl+xl-xK a ) ,! '- 0 < 


where Xj (is the significance point of the ^^distribution with three degrees 
of freedom at significance level a. 




CHAPTER 5 


The Generalized T 2 -Statistic 


5.1. INTRODUCTION 


One of the most important groups of problems in univariate statistics relates 
to the mean of u given distribution when the variance of the distribution is 
unknown. On the basis of it sumplc one may wish to decide whether the 
mean is equal to a number specified in advance, or one may wish to give an 
interval within which the mean lies. The statistic usually used in univariate 
statistics is the difference between the mean of the sample x and the 
hypothetical popukitioii mean fx divided by the .sample standard deviation s. 
If the distribution sampled is a 2 \ then 





X - /X 
s 


has the well-known f-distribution with A/ — 1 degrees of freedo ti, where N is 
the number of observations in the sample ‘ On the basis of this fact, one can 
set up a test of the hypothesis fx = /x 03 where ^ is specified, or one can set 
up a confidenre interval for the unknown parameter fx. 

The multivariate analog of the square of t given in (1) is 

(2) T 2 =-N{x~\iyS~\x~ ii) y 

where x is the mean vector of a sample of N, and S is the sample covariance 
matrix. It will be shown how this statistic can be used for testing hypotheses 
fihout the menn vector \i of the population and for obtaining confidence 
regions for the unknown jjl. The distribution of T 2 will be obtained when \i 
in ⑵ is the mean of the distribution sampled and when \i is different from 
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the population mean. Hotelling (1931) proposed the r 2 -statistic for two 
samples and derived the distribution when (A is the population mean. 

In Section 5.3 various uses of the r 2 -statistic are presented, including 
simultaneous confidence intervals for all linear combinations of the mean 
vector. A James-Stein estimator is given when 2 is unknown. The power 
function of the T 2 -test is treated in Section 5.4, and the multivariate 
Behrens—Fisher problem in Section 5‘5» In Section 5‘6, optimum properties 
of the r 2 -test are considered, with regard to both invariance and admissibil¬ 
ity. Stein’s criterion for admissibility in the general exponential family is 
proved and applied The last section is devoted to inference about the mean 
in elliptically contoured distributions. 

5.2. DERIVATION OF THE GENERALIZED 7 2 -STATISTIC 
AND ITS DISTRIBUTION 

5.2.1. Derivation of the r 2 -Statistic As a Function of the Likelihood 
Ratio Criterion 

Although the r 2 -statistic has many uses，we shall begin our discussion by 
showing that the likelihood ratio test of the hypothesis //: |x - |x 0 on the 
basis of a sample from M|x, 2) is based on the T 2 -statistic given in (2) of 
Section 5.L Suppose we have N observations x {i ^. y x N (N>p). The likeli¬ 
hood function is 

(1) L(^,X)-(27r)-^|5 ； r^exp 会 (Xa-WHn). 

a = l . 

The observations are given ； L is a function of the indeterminates |x, 2. (We 
shall not distinguish in notation between the indeterminates and the parame¬ 
ters.) The likelihood ratio criterion is 

maxL(|x 0J 2) 

(2) 入二一 ^ — ~ ( — 

、 ’ maxL(|x,z) 

m ■，玄 

that is, the numerator is the maximum of the likelihood function for (x, 2 in 
the parameter space restricted by the null hypothesis (|x - |x 0 , X positive 
definite), and the denominator is the maximum over the entire parameter 
space (2 positive definite). When the parameters are unrestricted, the maxi¬ 
mum occurs when |x, S are defined by the maximum likelihood estimators 
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(Section 3.2) of |ji and 2 ， 

(3) 

( 4 ) { x a- i ){ x a -^)'■ 

a-1 

When |x = n 0 , the likelihood function is maximized at 


(5) 




N 


E — 叫）（〜 一 h )’ 


by Lemma 3.Z2. Furthermore, by Lemma 3.Z2 


(6) 

maxL(|x, X) ^ 

e -w N 

(277,|i： n | 忉 ’ 

(7) 

maxL(|ji 0 ,2)- 

1 r - l ,pS 

(27T)^|tj^ 


Thus the likelihood ratio criterion is 

ie ( 心 i)，p 


( 8 ) 


A 


where 

(9) 


_ \A^ _ 

\A+N(x- ， 


A= E = {N-\)S. 


Application of Corollary A.3.1 of the Appendix shows 


( 10 ) 


入 V" 


\A 




+ N(x- |A 0 )M- I (I- |A 0 ) 
1 


+ T 2 /(N~ 1) 


where 


(11) r 2 =N(I-n 0 )^-'(jc-n 0 ) = (N-^Nix-Vi.yA^ix-vi,). 
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The likelihood ratio test is defined by the critical region (region of 
rejection) 

(12) a<A 0 , 

where A 0 is chosen so that the probability of (12) when the null hypothesis is 
true is equal to the significance level* If we take the \Nth root of both sides 
of (12) and invert, subtract 1, and multiply by N — 1 ， we obtain 

(13) 

where 

(14) T 0 2 ， -l)(d. 

Theorem 5«2.1. The likelihood ratio test of the hypothesis |x = |Ji 0 for the 
distribution N(|x, 2) is given by (13 )， where T 2 is defined by (11), x is the mean 
of a sample ofN from %)，S is the covariance matrix of the sample，and T 0 " 

is chosen so that the probability of (13) under the null hypothesis is equal lo the 
chosen significance level. 

The Student /-test has the property that when testing 弘 = 0 it is invariant 
with respect to scale transformations. If the scalar random variable X is 
distributed according to N( /jl, a 2 ), then X* = cX is distributed according to 
cV 2 ), which is in the same class of distributions, and the hypothesis 
SX = 0 is equivalent to = ScX = 0. If the observations A a are trans¬ 
formed similarly (x* - cc a ), then, for c > G, computed from is the 
same as t computed from x a . Thus，whatever the unit of measurement the 
statistical result is the same. 

The generalized 7 2 -test has a similar property. If the vector random 
variable X is distributed according to N(ii, 2), then = CX (for \C\ 共 0) is 
distributed according to N(C|x, C2C0, which is in the same class of distribu¬ 
tions. The hypothesis SX = 0 is equivalent to the hypothesis = SCX - 0. 
If the observations x a are transformed in the same way, jc* - Cx al then T* 2 
computed on the basis of jc* is the same as T 2 computed on the basis of x a . 
This follows from the facts that = Cx and A = CAC 1 and the following 
lemma ： 

Lemma ^5.2.1. For any p x p nonsingular matrices C and H and nw 
vector k y 


( 15 ) 


k'H- l k^{Cky{CHC'y\Ck). 
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Proof The right-hand side of (15) is 

(16) (Cky(CHC , y\Ck) ^k t C I (C t y i H~ ] C- [ Ck 

skU m 

We shall show in Section 5.6 that of all tests invariant with respect to such 
transformations, (13) is the uniformly most powerful. 

We can give a geometric interpretation of the ^Nth root of the likelihood 
ratio criterion, 


(17) 


2 / N = 1^^ =1 ( 气 - 幻 （心 一无） ’I 

* l d 1 (n 0 )(n 0 )’| ’ 


in terms of parallelotopes. (See Section 7.5*) In the p-dimensional represen¬ 
tation the numerator of 入 2 / N is the sum of squares of volumes of all 
parallelotopes with principal edges p vectors, each with one endpoint at x 
and the other at an jc a . The denominator is the sum of squares of volumes of 
all parallelotopes with principal edges p vectors, each with one endpoint at. 
[ji 0 and the other at jc a * If the sum of squared volumes involving vectors 
emanating from x, the “center” of the x al is much less than that involving 
vectors emanating from |i Dl then we reject the hypothesis that 卩 0 is the 
mean of the distribution. 


There is also an interpretation in the iV-dimensional representation. Let 
y t = ^x lS ) r be the rth vector. Then 


(18) 


歡 x, 


N 

E 


/N 


x, c 


is the distance from the origin of the projection of y t on the equiangular line 
(with direction cosines 1 / yfN 9 … ， 1/ yfN). The coordinates of the projection 
are Then (x n -x t ) is the projection of y t on the 

plane through the origin perpendicular to the equiangular line. The numera¬ 
tor of is the square of the /?-dimensional volume of the parallelotope 
with principal edges, the vectors (x iX —x t7 .. .,x lN -x t ). A point (x n - 
jlx 0| ., x tN - ju. 0f ) is obtained from y t by translation parallel to the equiangu¬ 
lar line (by a distance y[N /x 0i X The denominator of \ 2/N is the square of the 
volume of the parallelotope with principal edges these vectors. Then \ 2/N is 
the ratio of these squared volumes. 


5.2,2. The Distribution of T 2 

In this subsection we will find the distribution of T 2 under general condi¬ 
tions, including the case when the null hypothesis is not true. Let T 2 - Y f S~ l Y 
where Y is distributed according to Mv, S) and nS is distributed indepen¬ 
dently as & , Z ft Z\ x with independent, each with distribution 
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N(0,2). The T 2 defined in Section 5.2,1 is a special case of this with 
Y = - |x 0 ) and v = -/N (jji - |i 0 ) and n = N — L Let D be a nonsingu¬ 
lar matrix such that ="/, and define 


(19) K* - DY, S* -DSD f , v* 

Then 7 2 = Y* K* (by Lemma 5.2,1), where y* is distributed according 
to N(v*,/) and nS* is distributed independently as La^iZ*Z* r = 
YT^iDZj.DZ^y with the Z* = DZ a independent, each with distribution 
N(0, /). We note v’S 一 1 v = v* ; (/) _1 v* = v* by Lemma 5.2*1 ‘ 

Let the first row of a p 乂 p orthogonal matrix Q be defined by 


( 20 ) 


?ii 


y;* 




1)"*， p ， 


this is permissible because = 1* The other p - 1 rows can be defined 

by some arbitrary rule (Lemma AA2 of the Appendix). Since Q depends on 
y*，it is a random matrix. Now let 


( 21 ) 

From the way Q was defined, 


17 

B 


QY*, 

QnS*Q' 


T.q h Y* = ， 

( 22 ) ; _ 

' Uj = LqjX* = yfy^Uj,qu = 0 . ;^ 1 . 

Then 



V 1 b 12 b XpS 



T 2 ,. 

b 21 b 22 … b 2p 


0 


* . 4 




y b pl b pZ … b pp { 


i 0 ( 


where (b iJ ) = B^ 1 , By Theorem A.3.3 of the Appendix, l/b u = b n — 
b (i) B 22 b (i)^ b \i 2 p ，where 



(24) 


and T 2 /n = Ui/b n . 2t ^ tP ^ F*T*/& n , 2 ^ The conditional distribution of 
B given Q is that of where conditionally the V a - QZ^ are 
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independent, each with distribution MO,/). By Theorem 433 b l]k2 . . is 

conditionally distributed as Y, fl a Z \ p ^where conditionally the W a are 
independent, each with the distribution N(0,1); that is ， b u ‘ 2 ,“ p is condi¬ 
tionally distributed as x 2 with n - (p — l) degrees of freedom. Since the 
conditional distribution of b ua ^ t p does not depend on Q, it is uncondition¬ 
ally distributed as x 2 > The quantity has a noncentral ^^distribution 

with p degrees of freedom and noncentrality parameter v* f v* = v'S -1 v. 
Then T 2 /n is distributed aK the ratio of a nonccntnil x 1 and an independent 
X' 

Theorem 5.2.2. Let T 2 = K, where Y is distributed according to 
N(v, 2) and nS is independently distributed (is‘ ^ t Z„ Z^ f mth ，… 

independent，each with * distribution N(0, 2). Then (T 2 /n)[n l)/p] ^ 

distributed as a noncentral F with p and n — p + 1 degrees of freedom and 
noncentrality parameter 1 v. If v = 0, the distribution is central F> 

We shall call this the redistribution with n degrees of freedom. 

Corollary 5,2.1. Let x,,...,x N be a sample from iV(|x, 2), and let T 2 = 
— - jjl 0 ). The distribution of [T 2 /(N - l)][(N — p)/p] is non¬ 
central F with p and N — p degrees of freedom and noncentrality parameter 
N(\l- '(fi - fi 0 ). If = jit,,, then the F-distribution is central. 

The above derivation of the redistribution is due to Bowker (I960)* The 
noncentral F-density and tables of the distribution are discussed in Section 
5.4. 

For large samples the distribution of T 2 given by Corollary 5.2.1 is* 
approximately valid even if the parent distribution is not normal; in this sense 
the 7 2 -test is a robust procedure. 

Theorem 5.2.3. Let {X a }, cr = 1,2,..., be a sequence of independently 
identically distributed random vectors with mean vector \jl and covariance matrix 
2 ； let X^(l/N)L^ { X ay ^ = [l/(N-l)]E f ^ l (X fl -f N XX fr -f^) / , and 

= N(X n — |x o yS^ 1 {X N — |x 0 ). Then the limiting distribution of Tg as 
N ^ co is the 对 ribution with p degrees of freedom if |x 0 . 

Proof By the central limit theorem (Theorem 4.2.3) ihe limiting distribu¬ 
tion of (X N - (ji) is MO, 2). The sample covariance matrix converges 
stochastically to 2. Then the limiting distribution of is the distribution of 
where Y has the distribution N(0,2). The theorem follows from 
Theorem 3.3.3. _ 
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When the null hypothesis is true, T 2 /n is distributed as Xp/Xn- p +\ > and 
A 2/N given by (10) has the distribution of Xn- P ^\ A Xn- P ^\ + Xp^ The 
density of V- XaA xl + Xb^ when Xq and are independent, is 

⑼ 翁截，孙 

this is the density of the beta distribution with parameters kn and kb 
(Problem 5.27). Thus the distribution of 入 2/w = (1 + r 2 />z 广 ■ is the beta 
distribution with parameters \p and \(n 一 p + 1 乂 

5.3. USES OF THE r 2 -STATISTlC 

5.3 丄 Testing the Hypothesis That the Mean Vector Is a Given Vector 

The likelihood ratio test of the hypothesis |x = |x 0 on the basis of a sample of 
N from jV(|x ， 2) is equivalent to 

(1) r- > ti 

as given in Section 5.2.1. If the significance level is a, then the 100a% point 
of the F-distribution is taken, that is, 

(2) 了卜 {N N- ] p P F P ‘ 卜 P ( a X-M ， 

say. The choice of significance level may depend on the power of the test. We 
shall discuss this in Section 5.4. 

The statistic T 2 is computed from x and A. The vector A^ l (x - |jl 0 ) — b 
is the solution of Ab^=x - |x 0 . Then T 2 /{N - 1) = N(x- 
Note that T 2 /(N - 1) is the nonzero root of 

(3) =0. 

Lemma 53,1. If v is a vector of p components and if B is a nonsingular 
pXp matrU，then v'fi 一丨 v is the nonzero root of 

(4) |vv’ — AB| =0. 

Proof. The nonzero root, say of (4) is associated with a characteristic 
vector p satisfying 


(5) 


vv r p = A, 
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nt2 



Figure 5,1. A confidence ellipse. 


Since A, ^ 0, v’p 尹 0. Multiplying on the left by v we obtain 

(6) = A/v'p), ■ 

In the case above v = y[N{x- |x 0 ) and B = A. 

5*3.2, A Confidence Region for the Mean Vector 

If |x is the mean of N(ix t 2)，the probability is 1 - a of drawing a sample of 
N with mean x and covariance matrix S such that 

(7) 

Thus，if we compute (7) for a particular sample, we have confidence l — a 
that (7) is a true statement concerning |x. The inequality 

( 8 ) N{x-myS-'{x -m) ^T p ] N _,{a) 

is the interior and boundary of an ellipsoid in the p-dimensional space of m 
with center at x and with size and shape depending on S^ 1 and a. See 
Figure 5 丄 We state that |x lies within this ellipsoid with confidence 1 一 a. 
Over random samples (8) is a random ellipsoid. 


5J.3. Simultaneous Confidence Intervals for All Linear Combinations 
of the Mean Vector 

From the confidence region (8) for |jl we can obtain confidence intervals for 
linear functions 7 '|jl that hold simultaneously with a given confidence coeffi¬ 
cient. 

Lemma 5.3.2 (Generalized Cauchy-Schwarz Inequality). For a positive 
definite matrix S， 


( 9 ) 
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Proof. Let b = y f y/y f Sy, Then 


( 10 ) 


0 < (j? - bSy)'S~ x (y - bSy) 

=y'S~ x y-by'SS~ l y-y r S~ x Syb + b z y r SS~ x Sy 


y'S~ x y- 


(y'y ) 2 

y'Sy ' 


which yields (9). _ 


When y -x - pi, then (9) implies that 

(11) \y'(x - »JI)I < 

^yfy 7 S^ } /T p 2 iN ^(a)/N 


holds for all 7 with probability 1 — a，Thus we can assert with confidence 
1 一 tt that the unknown parameter vector satisfies simultaneously for all 7 
the inequalities 

(12) |7 ， x-7 ， m| < ]/y'Sy ]/T^ N ^ x (a)/N. 

The confidence region ( 8 ) can be explored by setting 7 in (12) equal to 
simple vectois such as ( 1 ， 0 ” ， . ， 0 )' to obtain m t ， ( 1 ，- 1 ， 0 ”.，，（)) to yield 
m x - m 2 , and so on. It should be noted that if only one linear function 
were of interest, ^7^_i(a) = v^P^.n-p + i( «)/( n ~P + l ) woulJ be 
replaced by / rt (a). 


S3 A. Two-Sample Problems 

Another situation in which the r 2 ^statistic is used is one in which the null 
hypothesis is that the mean of one normal population is equal to the mean of 
the other where the covariance matrices are assumed equal but unknown. 
Suppose y [ l \..., ^ is a sample from N(ijS 1 \ 2), i — 1,2. We wish to test the 
null hypothesis = jji (2) . The vector y (i) is distributed according to 
^L|Jt (,) ,(l/-/^)2]. Consequently jN'N 2 /{N' + N 2 ) — j?( 2 )) is distributed 

according to iV(0,2) under the null hypothesis. If we let 

(13) S = ^ + _ 2 1 w” - -y a) Y 

+ E (乂 2) -声 2) )(乂 2) -5 (2) ))， 

a t / 
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then (N { + N 2 — 2)S is distri 
according to JV(0,2). Thus 


L^L] Nz ^ 2 Z a Z , a , where Z a is distributed 


(14) r2 =7^^(3 i( ， ) -3* (2) ) ， s- 1 (3> (1) -^ ) ) 

is distributed as T 2 with N x + N 2 -2 degrees of freedom. The critical region 
is 

2 (N { +N 2 -2)p 

( 15 ) T > + N 2 -p - 1 尸 p, 叫 + 〜 -p-i( a ) 

with significance level a. 

A confidence region for (i ⑴一 p/ 2 ) with confidence level 1 - a is the set of 
vectors m satisfying 

( 16 ) 一 ji( 2 ) — mys^ y (y ({) — j?( 2 ) — m) 

s N x N 2 ip ，队 +N 疒 2、 a ) 




Simultaneous confidence intervals are 


(17) | 7 ^( 3 ； a) - - 7 ^ m | < ^yrsy ^ N j^^ z T^~ N ~~^{a ). 

An example may be taken from Fisher (1936). Let x { = sepal length ， 
x 2 = sepal width, x 2 — petal length, x A — petal width. Fifty observations are 
taken from the population Iris versicolor (1) and 50 from the population Iris 
setosa (2). See Table 3.4. The data may be summarized (in centimeters) as 


5.936 
2.770 
4.260 
丄 326』 
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(20) 985 = 


19.1434 9.0356 

9.0356 11,8658 

9.7634 4.6232 

, 3.2394 2.4746 


9.7634 32394^ 

4.6232 2.4746 

12.2978 3.8794 

3.8794 2.4604, 


The value of T 2 /9S is 26*334, and (T z /98) xf = 625.5. This value is highly 
significant cjmpared to the F-value for 4 and 95 degrees of freedom of 3*52 
at the 0.01 significance level. 

Simultaneous confidence intervals for the differences of component means 
^ - ^ 2) , r-1,2,3,4, are 0.930 土 0.337 ， 一 0.658 土 0.265 ， 一 2.798 土 0.270, 
and 1.080 土 0.121. In each case 0 does not lie in the interval. [Since ，你 (.01) < 
T 4 98 C01), a univariate test on any component would lead to rejection of the 
null hypothesis.] The last two components show the most significant differ¬ 
ences from 0. 


5.3.5, A Problem of Several Samples 

After considering the above example, Fisher considers a third sample drawn 
from a population assumed to have the same covariance matri:. He treats the 
same measurements on 50 Iris virginica (Table 3.4). There is a theoretical 
reason for believing the gene structures of these three species to be such that 
the mean vectors of the three populations are related as 


(21) 3|x (1; = |x (3) + 2(x t2) , 

where |x (3) is the mean vector of the third population* 

This is a special case of the following general problem. Let a = 

N" 1，…， be samples from M|x (,) , 2), /= I, … ， g ， respectively. 
Let us test the hypothesis 

( 22 ) 

where … ， (3 q are given scalars and |x is a given vector. The criterion i 、 


(23) 


T 2 =c 


⑴ -|X s - 1 

J 


i 


where 
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This 7** has the redistribution with q degrees of freedom* 

Fisher actually assumes in his example that the covariance matrices of the 
three populations may be different. Hence he uses the technique described in 
Section 5.5* 

53.6. A Problem of Symmetry 

Consider testing the hypothesis = 弘 2 = “• = on the basis of a 

sample : c 卜 from X), where (x ; = (/x^., /x p ). Let C be any 
(p - 1) Xp matrix of rank p - l such that 

(27) Ce = 0, 
where e’ = (1,, 1). Then 

(28) y a = Cx a , cx= l ， … ， N ， 

has mean Cjx and covariance matrix C^C\ The hypothesis H is Cjjl = 0. 
The statistic to be used is 

(29) T 2 = ^S~ l y, 
where 

1 N 

( 3CI ) 卜 jj Ly a = Ci ， 

a = 1 

1 N 

(- >i) s=jj^ Z (y a -y)(y a ~yy 

a =[ 

=E {x a -x)(x a -i)'C'. 

a = 1 

This statistic has the redistribution with N — 1 degrees of freedom for a 
(p - l)-dimensiona] distribution. This T 2 -statistic is invariant under any 
linear transformation in the p - 1 dimensions orthogonal to e. Hence the 
statistic is independent of the choiee of C. 

An example of this sort has been given by Rao (1948b). Let N be the 
amount of cork in a boring from the north into a cork tree; let E, S, and W 
be defined similarly. The set of amounts in four borings on one tree is 




5.3 USES OF THE『‘-STATISTIC 


183 


the covariance matrix for y is 




128.72 

61.41 

-21.02) 

(34) S = 

61.41 

56.93 

-28.30 


l — 21.02 

-28.30 

63.53 J 


The value of T 2 /(N — 1) is 0.768. The statistic 0.768 X 25/3 = 6.402 is to be 
compared with the insignificance point with 3 and 25 degrees of freedom. It 
is significant at the 1% level. 

53.7. Improved Estimation of the Mean 

In Section 3.5 we considered estimation of the mean when the covariance 
matrix was known and showed that the Stein-type estimation based on this 
knowledge yielded lower quadratic risks than did the sample mean. In 
particular, if the loss is (m - 一 V/n — jjl), then 

(35) (J ~ V) + V 

is a minimax estimator of for any v and has a smaller risk than jc when 
/7 ^ 3. When X is unknown, we consider replacing it by an estimator, namely, 
a multiple of ^ = nS. 

Theorem 5,3,1. When the loss is (m — 一 Kin — p), the estimator for 
p^：3 given by 

( 36 ) ( 1_ N(x-vyA~ l (x-v) +v 

has smaller risk than x and is minimax for Q < a <2(p — 2)/(n 一 p + 3 )， and 
the risk is minimized for a =(p - 2)/(n - /? + 3). 


considered as an observation from a 4-variate normal distribution. The 
question is whether the cork trees have the same amount of cork on each 
side. We make a transformation 

y x =N-E-W^tS y 

(32) y 2 = s-w 9 

y 3 - S. 

The number of observations is 28. The vector of means is 


6 0 6 
8_ J.8 
8 4 0 

H 


3 ) 

3 

/IV 
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Proof 、As in the case when X is known (Section 3.5.2), wc can make a 
transformation that carries (l/N)X to /. Then the problem is to estimate jji 
based on Y with the distribution /) and A = Yl n a=l Z a Z f Qi where 

Z lf .. ty Z n are independently distributed, each according to MO, /), and the 
loss function is (m - — jt). (We have dropped a factor of N.) The 

difference in risks is 


(37) 


= ^ yy-jtii 2 - l- 


(y-v) , /i- 1 (y- 


(y - V) + V — (Jl 


'2 


(Y-v)'A-'(Y-v) ^ 


L (^ _ ~ v i) 



—--:- V 

l(Y-v)'A-'(Y-v)] 

The proof of Theorem 5.2.2 shows that (Y- v)'A~'(Y-v) is distributed as 
||y- v\\ 2 /x^-p + x , where the Xn- P +i ' s independent of Y. Then the differ¬ 
ence in risks is 



(38) 




2a Xn-p+l 

ii 卜 vii 2 " 


~ v ，) ~ 

i-i 


a 2 UV,) 2 | 

\\Y-v\\ 2 I 


( 2a (P ~ 2 )Xn-p+l 

« 2 (^% + i) 2 1 

l|y-vil 2 

1 

\\Y-v\\ 2 1 


=^2( p — 2)(n —p + l)a 


— 2( n — l 


The factor in braces is n -p + 1 times 2(/? - 2)a - (n -/? -f 3)fl 2 , which 
is positive for 0 < a < 2(/? — 2) / {n 一 /? + 3) and is maximized for a = 
(p - 2)/(n —p + 3). ■ 

The improvement over the risk of Y is (n -p A- 1)(p — 2) 2 /(n -/? + 3)- 
^\\Y — v\\~ 2 , as compared to the improvement (/? 一 2) 2 <^J|y — v『 2 of m{y) 
of Section 3.5 when X is known. 



5.4 DISTRIBUTION UNDER ALTI£RNAT1VHS , ； POWl£R FUNCTION 


185 


Corollary 53*1. The estimator for p > 3 


(39) 


1 - 


N(x — v) 1 A~ ] (x — v) 


i- v) + v 


has smaller risk than (36) and is minimax for 0 <a < 2( p - 2)/(;i —/; + 3\ 


Proof 、This corollary follows from Theorem 5.3.1 and Lemma 3.5,2. ■ 

The risk of (39) is not necessarily minimised at a = (p — 2)/(n — p + 3). 
but that value seems like a good choice. This is the estimator (18) of Section 
3.5 with 2 replaced by [1/(/2 - p + 3)]A. 

When the loss function is (m - (x)’g(m — (x), where Q is an arbitrary 
positive definite matrix, U is harder to present a uniformly improved CvStima- 
tor that is attractive. The estimators of Section 3.5 can be used with 2 
replaced by an estimate. 


SA THE DISTRIBUTION OF T 1 UNDER ALTERNATIVE 
HYPOTHESES; THE POWER FUNCTION 


In Section 5.2.2 we showed that (T 2 /nXN -p)/p has a noncentrtil F-distri^ 
bution. In this section we shall discuss the noncentral F-distribution. its 
tabulation, and applications to procedures based on T 2 . 

The noncentral ^-distribution is defined，is the distribution of the ratio of 
a noncentral x 2 and an independent x 2 divided by tlie ratio of correspond ¬ 
ing degrees of freedom. Let V have the noncentral ,v^distribution with /) 
degrees of freedom and noncentrality parameter r : (as given in Theorem 
3.3.5), and let W be independently distributed as x" with m degrees of 
freedom. We shall find the density of F = (K/p)/(M’/m), wliich is the 
non central F with noncentrality parameter r 2 . The joint density of V and W 
is (28) of •Section 3.3 multiplied by the density of W, which is 
2 “ The joint density of F and W {do = pwdf/m) is 


( 1 ) 


2+ ( ” m ) r (士 m) 


4>v( 1 +pf/r?i > 


上 

m 


L 

/3-D 


X 


(5ir(^ P + (5) ( 飪 ) 


+ 1 


^-r f)i } -r fi- 


The marginal density, obtained by integrating (1) with respect to iv from 0 
to oo, is 


pe 十 y (t ： /^) P ( r[：( p + »1) +15] 

mr (» ^ l5\T(y + (3)(1 + ^ 
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Theorem 54.1. If V has a noncentral x 2 ^ st ^ u ^ on P 如 grees of 
freedom and noncentrality parameter 丁 2 , and W has an independent x 2 ^stribu- 
tion with m degrees of freedom, then F = (V/p) /(W/ m) has the density (2). 


The density (2) is the density of the noncentral F-distribution. 

If T 2 — N(x - 4 0 )'5 ' 1 (无 一 p 0 ) is based on a sample of jV from M|x,2), 
then {T 2 /n\N - p)/p has the noncentral F'distribution with p and N - p 
degrees of freedom and noncentrality parameter N(\l — - = 

r 2 . From (2) we find that the density of T 2 is 


f (TV2)T 2 /(N-i)l& +J3-l r(# + w 

(W- !) r [K^-p)] ,3=0 (i\T({p + ^)[l+t 2 /(N-l)]^ N+P 


(N-i)r[i(N- p )]r(ip)U-iJ l 1 n -\ 

lFl (^ ； ^ ； 2(N- 1) j’ 


where 


( 4 ) 


{ F x {a-,b\x)= D 


T(a+fi)T{b)x^ 

0 T(a)r(b + p)Jl 


The density (3) is the density of the noncentral T 2 -distribution. 

Tables have been given by Tang (1938) of the probability of accepting the 
null hypothesis (that is, the probability of Type II error) for various values of 
丁 2 and for significance levels 0.05 and 0.01. His number of degrees of 


freedom f { is our p [1(1)8], his f 2 is our n - p + 1 [2,4(1)30, 60, od], and his 
noncentrality parameter is related to our t 2 by 


( 5 ) 




[1( 士 )3( 1)8], His accompanying tables of significance points are for T 2 /(T 2 + 
N - 1 ). 

As an example, suppose p = 4, n - p + 1 = 20, and consider testing the 
null hypothesis = 0 at the 1% level of significance. We would like to know 
the probability, say，that we accept the null hypothesis when 2.5 (r 2 — 
31.25). It is 0.227. If we think the disadvantage of accepting the null 
hypothesis when N ， \l 、and S are such that r 2 = 31.25 is less than the dis¬ 
advantage of rejecting the null hypothesis when it is true, then we may find it 
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reasonable to conduct the test as assumed. However, if the disadvantage of 
one type of error is about equal to that of the other, it would seem reason¬ 
able to bring down the probability of a Type II error. Thus, if we use a 
significance level of 5%, the probability of Type II error (for </> = Z5) is only 
0.043. 

Lehmer (1944) has computed tables of 令 for given significance level and 
given probability of Type II error. Here tables can be used to see what value 
of r 2 is needed to make the probability of acceptance of the null hypothesis 
sufficiently low when |x # 0. For instance, if we want to be able to reject the 
hypothesis jx == 0 on the basis of a sample for a given jx and 2, we may be 
able to choose N so that = t 2 is sufficiently large. Of course, the 

difficulty with these considerations is that we usually do not know exactly the 
values of (x and 2 (and hence of r 2 ) for which we want the probability of 
rejection at a certain value. 

The distribution of T 2 when the null hypothesis is not true was derived by 
different methods by Hsu (1938) and Bose and Roy (1938). 


S.S* THE TWO-SAMPLE PROBLEM WITH UNEQUAL 
COVARIANCE MATRICES 


If the covariance matrices are not the same, the retest for equality of mean 
vectors has a probability of rejection under the null hypothesis that depends 
on these matrices. If the difference between the matrices is small or if the 
sample sizes are large, there is no practical effect. However, if the covariance 
matrices are quite different and/or the sample sizes are relatively small, the 
nominal significance level may be distorted. Hence we develop a procedure 
with assigned significance level. Let a= 1,...,/^, be samples from 

/ = 1,2 We wish to test the hypothesis //:〆【) = |x (2) . The mean 
x (1) of the first sample is normally distributed with expected value 

(1) 办⑴ = jjl ⑴ 
and covariance matrix 

(2) 汽沪 )-^))( i ( l )+2,. 

Similarly, the mean x (2) of the second sample is normally distributed with 
expected value 


( 3 ) 


办 2) (2) 
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and covariance matrix 

(4) - 〆〉)(#- W = l! 2 . 

Thus x (l) -x (2) has mean - yS 2) and covariance matrix (1/^)5^ + 
(1 /N 2 )^ 2 . We cannot use the technique of Section 5.2, however, because 

(5) E {x^ - x^)(x^ - x^)'+ L {x^-x^)(x^-x^y 

a= 1 a= 1 

does not have the Wishart distribution with covariance matrix a multiple of 

If N x = N 2 s=i say, we can use the r 2 -test in an obvious way. Let 
y a — x^ l) - x ( a 2) (assuming the numbering of the observations in the two 
samples is independent of the observations themselves). Then y Q is normally 
distributed with mean |jl ⑴一 〆 2) and covariance matrix X x + and 
are independent Let ^ = (1 /N)E^ = i,y a = i (!) - x (2 \ and define S 
by 

N 

(6) (N-l)S^ Z (y a -y)(y a -y)' 

a = 1 

= 二（尤 L 1 〉- A 2 ) -无 (1) + 无 ( 2 ) )( 尤 L 】)- x L 2 ) _ 无⑴ +? (2) )’- 
a-1 

Then 

( 7 ) T 2 ^Ny f S~ l y 

is suitable for testing the hypothesis (jl ⑴一 yS 2) = 0, and has the redistribu¬ 
tion with N — l degrees of freedom. It should be observed that if we had 
known 2^ = S 2 , we would have used a 7 1 ^statistic with 2N - 2 degrees of 
freedom; thus we have lost N - \ degrees of freedom in constructing a test 
which is independent of the two covariance matrices. If N x = N 2 = 50 as in 
the example in Section 5.3.4, then r 4 2 49 (_01) = 15.93 as compared to r 4 % 8 (.01) 
- 14.52. 

Now let us turn our attention to the case of ^ N 2 . For convenience, let 
N x < N 2 . Then we define 

⑻ + 南 X ?，)， a = …， N' 
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The expected value of y a is 

( 9 ) 仏 + 赢 〆 、-鲁， S (' 

The covariance matrix of y a and is 

( 10 ) 

Thus a suitable statistic for testing |x (l) - |x (2) = 0, which has the redistribu¬ 
tion with N } — l degrees of freedom, is 

(11) r 2 = ws ，， 

where 

(12) y = jj-Yl y a = x (l) -x (Z) 

1 a=\ 

and 

/v, A 

(13) (N,- 1 )S= E (ya-y)(y a -yr- E 

o: ~ 1 a = 1 

where u = and u a = 0 ：^ - yjN x /N z x ( ;\ a = \,...,N v 

This procedure was suggested by Scheffe (1943) in the univariate case. 
Scheffe showed that in the univariate case this technique gives the shortest 
confidence intervals obtained by using the fdistribution. The advantage of 
the method is that x (l) -x (2) is used，and this statistic is most relevant to 
jjl ⑴一 The sacrifice of observations in estimating a covariance matrix is 

not so important. Bennett (1951) gave the extension of the procedure to the 
multivariate case. 

This approach can be used for more general cases. Let a = 1,..., 

/ =1 ， … ， 9 ， be samples from /V(〆 0 , i’ = l ， … ， 9 ， respectively. Consider 
testing the hypothesis 

(14) H: ft〆 。： 屮， 

1 

where /3 ls ..., j 8 q are given scalars and |jl is a given vector If the N { are 
unequal, take N { to be the smallest ‘ Let 

(15) + E A/f 卜)-奋 E 4。 + i E < - 
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Then = and 

(16) ^(y a -^y a )(y p -^y p y = s ap L . 

Let y and S be defined by 

(17) y=^Ly a = L i U) =jj-L 

1 a ®= \ i«I 』 旮 《 1 

(i«) (%- i)s= [ (: v u — jOU—30 、 

a= I 

Then 

(19) 户，卜 … _1 ( 卜 fO 

is suitable for testing H ，and when the hypothesis is true, this statistic has the 
r 2 "distribution for dimension p with — 1 degrees of freedom. If we let 
u a = L c !^ { ft0V 丨 /A^ 0 , a = 1 ， … ， JV" then 5 can be defined as 

(20) (N,-1)S= E (u a -5)( Ua -5)\ 


Another problem that is amenable to this kind of treatment is testing the 
hypothesis that two subvectors have equal means. Let x = (x (1), ,x (2)f ) / be 
distributed normally with mean \l = ^ (2), ) ; and covarirnce matrix 


( 21 ) 


X 


n 


21 


X.2 

X 


22 


We assume that and x (2) are each of q components. Then y =x {]) -x (2) 
is distributed normally with mean (jl …一 (jl (2) and covariance matrix X y = X n 
- X ： i - X 12 + X 22 . To test the hypothesis - \l {2) we use a T 2 -statistic 
NyS'; 、 y ，where the mean vector and covariance matrix of the sample are 
partitioned similarly to \l and X. 


5.6, SOME OPTIMAL PROPERTIES OF THE r 2 -TEST 
5.61 Optimal Invariant Tests 

In this section we shall indicate that the 7 2 -test is the best in certain classes 
of tests and sketch briefly the proofs of the.se results. 

The hypothesis = 0 is to be tested on the basis of the N observations 
ly x N from jV(jjl ， S 乂 First we consider the class of tests based on the 
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statistics A = YXx a — xXx a —x) 1 and x which are invariant with respect to 
the transformations A* = CAC f and i* - Cx f where C Is nonsingular. The 
transformation x* = Cx a leaves the problem invariant; that Is, in terms of x* 
we test the hypothesis = 0 given that xf, ..x* are jV observations 
from a multivariate normal population. It seems reasonable that we require a 
solution that is also invariant with respect to these transformations; that is, 
we look for a critical region that is not changed by a nonsingular linear 
transformation. (The definition of the region is the same in different coordi¬ 
nate systems.) 

Theorem 5.6.L Given the observations x,,..., from N(|x, X), of all 
tests of = Q based on x and A — IXx a — x)(x a - xY that are invariant with 
respect to transformations x* = Cx t A* = CAC 1 (C nonsingular), the T 2 -test is 
uniformly most powerful. 

Proof. First, as we have seen in Section 5,2.1, any test based on T 2 is 
invariant- Second, this function is essentially the only invariant, for if /(i, A) 
is invariant, then /(jc, A) where only the first coordinate of x* is 

different from zero and it is yx f A~ [ x . (There is a matrix C such that 
Cr = i* and CAC 9 =/.) Thus f(x. A) depends only on Thus an 

invariant test must be based on x f A^ x x. Third, we can apply the Neyman- 
Pearson fundamental lemma to the distribution of T 2 [(3) of Section 5.4] to 
find the uniformly most powerful test based on T 1 against a simple alterna¬ 
tive 丁 1 = iVfi/X -1 卜 The most powerful test of r 2 0 is based on the ratio of 
(3) of Section 5.4 to (3) with t 2 = CL The critical region is 


( 1 ) 

(^/2) a ^/nf + a ~' + 1) + a] 

h ^{fp + a) 

/( r» 卜 + 卜 + n m (” + l)] 
/ r(ip) 


mn P + \)] e 


(T 2 /2) a r [士 (/! + 1) + a] 

tt!r ( 士 p + or) 


' t 2 /n 、 


The right-hand side of (1) is a strictly increasing function of(t 2 /n)/(l + t 2 /n\ 
hence of t 2 . Thus the inequality is equivalent to t 2 > k for k suitably chosen. 
Since this does not depend on the alternative t 2 , the test is uniformly most 
powerful invariant, ■ 
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Definition 5.6,h A critical function A) is a function with ualues 
between 0 and 1 (inclusiue) such that A) = s, the significance level, when 

|x = 0, 

A randomized test consists of rejecting the hypothesis with probability 
ipixj B) when x = x and A= B. A nonrandomized test is defined when 
4f(x, A) takes on only the values 0 and L Using the form of the 
Neyinnn-Pcarson lemma appropriate for critical functions, we obtain the 
following corollary: 

Corollary 5.6.1. On the basis of observations x u … 、 x N from X), of 
all randomized tests based on x and A that are invariant with respect to 
transformations x* = Cx, A* =CAC f (C nonsingular)，the T 2 -test is uniformly 
most powerful. 

Theorem 5.6-2. On the basis of observations x u 、 x N from of 

all tests of ii = Q that are invariant with respect to transformations x* = Cx a 
(C nonsingular), the T 2 - test is a uniformly most powerful test; that is, the T 2 -test 
is at least as powerful as any other invariant test. 

Proof. Let ip{x iy …， 〜）be the critical function of an invariant test. Then 

(2) ，…， = *j x n ) 1-^5 ^]}• 

Since JE $ A are sufficient statistics for the expectation S[ i[f{x v ..., 

A] depends only on jE, A y It is invariant and has the same power as 
il/(x ly ... 1 x N \ Thus each test in this larger class can be replaced by one in 
the smaller class (depending only on x and A) that has identical power. 
Corollary 5.6,1 completes the proof. ■ 

Theorem 5,6 J. Given observations x v •… x N from X), of all tests of 
p = 0 based on x and A = E(x a — x)(x a -xY with power depending only on 
the T 2 4est is uniformly most powerful. 

Proof. We wish to reduce this theorem to Theorem 5.6.1 by identifying the 
class of tests with power depending on V the class of invariant 

tests. We need the following definition: 

Definition 54.2. A test 中 (x” .x w ) is said to be almost invariant if 

(3) Xj,,.,, x ^^ Cij ， ” ， Cr") 

for all x v ... 1 x N except for a set of x l ^..,x N of Lebesgue measure zero; this 
exception set may depend on C. 
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It is clear that Theorems 5.6*1 and 5,6.2 hold if we extend the definition of 

invariant test to mean that (3) holds except for a fixed set of _x v of 

measure 0 (the set not depending on C). It has been shown by Hunt and 
Stein [Lehmann (1959)] that in our problem almost invariance implies invari¬ 
ance (in the broad sense). 

Now we wish to argue that if tp(x. A) has power depending only on 
jVjt’S — V， it is almost invariant. Since the power of A) depends only on 
— i‘hc power is 

(4) 0( ^5 

= U(Ci ， CAC r ). 

The second and third terms of (4) are merely different ways of writing the 
same integral. Thus 

(5) 

identically in jji, S. Since x 7 A are a complete sufficient set of statistics for 
ii, 2 (Theorem 3.4.2), f(x, A) == A) - tf/(Cx,CAC , ) = G almost every¬ 
where. Theorem 5.6.3 follows. ■ 

As Theorem 5.6.2 follows from Theorem 5.6.1, so does the following 
theorem from Theorem 5.6.2: 

Theorem 5.6.4. On the basis of observaiions x v ., *, from of 

all tests o/ n = 0 with power depending only on the T 2 -test is a 

uniformly most powerful test. 

Theorem 5.6.4 was first proved by Simaika (1941). The results and proofs 
given in this section follow Lehmann (1959). Hsu (1945) has proved an optimal 
property of the retest that involves averaging the power over and X. 

5.6.2. Admissible Tests 

We now turn to the question of whether the 7 2 -test is a good test compared 
to all possible tests; the comparison in the previous section was to the 
restricted class of invariant tests. The main result is that the 7 : -test is 
admissible in the class of all tests; that is, there is no other procedure that is 
better. 


Definition S,6»3, A test T* of the null hypothesis H 0 : against the 

alternative wefl, (disjoint from fl 0 ) is admissible if there exists no other test 7 
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mch that 

( 6 ) Pr{Reject H {) \T, oj) < Pr{ Reject H 0 \T* ， co }， fl 0 , 

(7) Pr{Reject H 0 | T, w} > Pr{Reject H 0 \ T*, w), co e fl l? 

with strict inequality for at least one oj. 

The admissibility of the T 2 -test follows from a theorem of Stein (1956a) 
that applies to any exponential family of distributions. 

An exponential family of distributions (W ， ⑷， m ， n ， 尸 ） consists of a finite¬ 
dimensional Euclidean space 少 ， a measure m on the ( 7 *algebra 通 of all 
ordinary Bore] sets of V/，a subset il of the adjoint space °1/ 1 (the linear 
space of all real-valued linear functions on "?/) such that 


(S) = J y dm( 3 ;) < oo, oj ^Ci 1 

and P, the function on fl to the set of probability measures on 遇 given by 


匕 ㈤ 




dm(y), 


A em. 


The family of normal distributions constitutes an exponential 

family, for the density can be written 


( 9 ) 






(27t)^|I| 


办 H + tr 卜 


We map from ^to the vector y = (y i), , y (2), y is composed of y (Vf =x 
and y {2) = (.r^, 2 x^ 2 ，. …， The vector co = (co (l) \ co (2), )’ is 

composed of co ( 1 ) = 叉一 1 and co (2) = — \{<r y \ a 12 ，"” a lp , cr 22 ，. •• ， (t pp )\ 
where (a lJ ) = 2 _1 ; the transformation of parameters is one to one. The 
measure m(A) of a set A is the ordinary Lebesgue measure of the sec of 
x that maps into the set /!. (Note that the probability measure in *?/ is not 
defined by a density.) 


Theorem 5.6.5 (Stein). Let (^/, 成 m ， fl ， P) be an exponential family 
and fl 0 a nonempty proper subset of H. (i) Let A be a subset of °V that is closed 
and convex, (ii) Suppose that for every vector co e 少 ’ and real c for which 
{j| o)\v > c*} and A arc disjoint' there exists a)! g fl such that for arbitrarily large 
入 the vector + Aco e fl - fl (r Then the test with acceptance region A is admis¬ 
sible for testing the hypothesis that w e Cl {) against the alternative co e O — fl 0 . 
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The conditions of the theorem are illustrated in Figure 5.2, which is drawn 
simultaneously in the space ^ and the set Cl. 

Proof. The critical function of the test with acceptance region A is 
<l> A (y) = 0. y and <l> A (y) = 1, ^ ^A. Suppose is the critical function 
of a better test, that is, 


(10) 

f ( t f (y) dp ^(y) ^ dp Jy)^ 

co ^ Oq, 

(11) 

f Hy) dP^y) > fct> A (y) dp^y). 

co ^ fi — Oq, 


with strict inequality for some co; we shall show that this assumption leads to 
a contradiction. Let B = {jy| ^(y) < 1}. (If the competing test is nonrandom- 
ized, B is its acceptance region.) Then 

(12) j) > 0} = 

where A is the complement of A. The m-measure of the set (12) is positive; 
otherwise 4> A (y)= <f>(y) almost everywhere, and (10) and (11) would hold 
with equality for all co. Since A is convex, there exists an co and a c such that 
the intersection of / Pi 5 and {ylco’j > c} has positive 爪， measure. (Since A 
is closed, A is open and it can be covered with a denumerable collection of 
open spheres, for example, with rational radii and centers with rational 
coordinates. Because there is a hyperplane separating A and each sphere, 
there exists a denumerable collection of open half-spaces Hj disjoint from A 
that covers A. Then at least one half-space has an intersection with A C\B 
with positive m-measure*) By hypothesis there exists o> 1 ^ H and an arbitrar¬ 
ily large 入 such that 


(13) 


co A = cOj -H A co g 
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Then 

( 14 ) f[4> A (y) - ^(^)] d ^,(y) 

^f [<l>A(y) - <l>(y)] e ^ f dm(y) 

= ![Uy)~ MW^dp^y) 

+ / [4>A(y)-<t>(y)]e H ^-^dP4y)\. 

J u» , y£c ) 

For <o> > c we have <f> A (y) = 1 and cf> A (y) - cf>(y) > 0 , and {y\<f> A (y)- (f>(y) 
>0} has positive measure; therefore, the first integral in the braces ap¬ 
proaches oo as A oo. The second integral is bounded because the integrand 
is bounded by 1， and hence the last expression is positive for sufficiently large 
A. This contradicts (11). ■ 

This proof was given by Stein (1956a). It is a generalization of a theorem 
of Bimhaum (1955). 

Corollary 5.6.2. If the conditions of Theorem 5,6.5 hold except that A is 
not necessarily closed，but the boundary of A has m-measure 0, then the 
conclusion of Theorem 5.6.5 holds. 

Proof The closure of A is convex (Problem 5.18), and the test with 
acceptance region equal to the closure of- A differs from A by a set of 
probability 0 for all <o e fl. Furthermore ， 

(15) /! Pi { 少 |<0> >c} = 0 — A (z{y\o} r y <c} 

closure A c {^|co^ < c}. 

Then Theorem 5.6.5 holds with A replaced by the closure of A, ■ 

Theorem 5.6-6- Based on observations f x N from 2 )， 

Hotelling's T 2 -test is admissible for testing the hypothesis p = 0. 
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Proofs To apply Theorem 5.6.5 we put the distribution of the observations 
into the form of an exponential family. By Theorems 3,3.1 and 3,3.2 we can 
transform 工 ” ••” j:" to where is orthogonal and z s 

= yfNx. Then the density of z N (with respect to Lebesgue measure) is 

( 16 ) ( 27 t)^|X|^ exp [^_ 1 z " + tr ( — 3 -1 ) ^ L z ’° • 

The vector 夕 =() (1) ，少 (2 ”)' is composed of y {{) = z iV (- yfNx) and fc v l: 、= 
(i?• • • ， 2 办 1 夕，占 22 ，” • ， bpp ) ， where 

( 17 ) ' B = E U, = E - 

The vector w = (co (l)/ ,co (2) 0 , is composed of co (1) = yfNX~ l ii and w (2) - 
-^(fj n ,a 12 ,..., a 22 ,..., (j pp )\ The measure m(A) is the Lebesgue 

measure of the set of z x 、 … ， z N that maps into the set A. 

Lemma 5.6.1, Let B ^ A + Nxx\ Then 


Nx'A- { x 


NxB' x x 

l-Nx f B- { x 


Proof of Lemma. If we let B + ^Jx^fNx , in (10) of Section 5.2, we 
obtain by Corollary A3.1 

⑽ _1_= A V-v = \B-^N-x/Nx\ 

K } l + T z /(N- 1) \B\ 

=1 -Nx'B^'x. U 

Thus the acceptance region of a r 2 -test is 

(20) A = [z N , B\z' N B~ l z s <k, B positive definite} 


for a suitable k. 

The function z N B" l z N is convex in {z, B) for B positive definite (Pioblem 
5,17). Therefore, the set z r ^B^ l z N <k\s convex. This shows that the set A is 
convex. Furthermore, the closure of A is convex (Problem 5.18), and the 
probability of the boundary of A is 0. 

Now consider the other condition of Theorem 5,6.5. Suppose A is disjoint 
with the half-space 


(21) 


c < i^'y — v f z x — 7 tr AB, 
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where A is a symmetric matrix and B is positive semidefinite. We shall take 
A, = I. We want to show that t 0 [ + A<o e H - H 0 ; that is, that vj + Av # 0 
(which is trivial) and + 入八 is positive definite for 入 > 0. This is the case 
when X is positive semidefinite. Now we shall show that a half-space (21) 
disjoint with A and A not positive semidefinite implies a contradiction. If A 
is not positive semidefinite, it can be written (by Corollary A.4.1 of the 
Appendix) 


( 22 ) 


I 0 0 

八 0-/0 D', 
0 0 0 


where D is nonsingular. If A is not positive semidefinite, —/is not vacuous, 
because its order is the number of negative characteristic roots of A. Let 
I 、 ={\/y)z. {) and 


(23) 


Then 



0 0 

yl 0 D l . 
0 / 


(24) 


= ~v'z 0 + 2 tr 


0 

0 


0 0 
■yj 0 
0 0 


which is greater than c for sufficiently large y. On the other hand 

/ 0 0 

0 y~ l I 0 D'z 0 , 

0 0 / 

which is less than k for sufficiently large y. This contradicts the fact that (20) 
and (21) are disjoint. Thus the conditions of Theorem 5*6.5 are satisfied and 
the theorem is proved. ■ 

This proof is due to Stein. 

An alternative proof of admissibility is to show that the T 2 -test is a proper 
Bayes procedure. Suppose an arbitrary random vector X has density /(x|<o) 
for o> e Consider testing the null hypothesis : to e H 0 against the 
alternative G fi — fl 0 . Let IT 0 be a prior finite measure on H 0 , and 

a prior finite measure on fi^ Then the Bayes procedure (with 0-1 loss 


(25) 


fB 




r 
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function) is to reject H Q if 

J/(j ： |a))n i (dw) 

(26) - > c 

f f(x\<o)Yl Q (d<a) 

for some c (0 ^ c < oo). If equality in (26) occurs with probability 0 for all 
o>en 0 , then the Bayes procedure is unique and hence admissible. Since the 
measures are finite, they can be normed to be probability measures. For the 
r 2 -test of // 0 : |x - 0 a pair of measures is suggested in Problem 5.15. (This 
pair is not unique.) The reader can verify that with these measures (26) 
reduces to the complement of (20). 

Among invariant tests it was shown that the retest is uniformly most 
powerful; that is，it is most powerful against every value of among 

invariant tests of the specified significance level. We can ask whether the 
r 2 -test is “best”against a specified value of ix'X'V among all tests. Here 
“best” can be taken to mean admissible minimax; and “minimax” means 
maximizing with respect to procedures the minimum with respect to parame¬ 
ter values of the power. This property was shown in the simplest case of 
p — 2 and N ~ 3 by Giri ， Kiefer, and Stein (1963). The properly for general p 
and N was announced by Salaevskii (1968). He has furnished a proof for the 
case of p = 2 [Salaevskii (1971)]，but has not given a proof for p> 2. 

Giri and Kiefer (1964) have proved the r 2 -test is locally minimax (as 
|x'2 -1 |x ^ 0) and asymptotically (logarithmically) minimax as |x’2 _1 |x — oo. 

5.7. ELLIFHCALLY CONTOLRED DISTRIBUTIONS 

5.7.1. Observations Elliptically Contoured 
When x'” ， . ， x N constitute a sample of N from 

(].) |A| _ V[(oc- v)’A _ 1 (j ： -v )]， 

the sample mean x and covariance S are unbiased estimators of the distribu¬ 
tion mean |x = v and covariance matrix X = (^R 2 /p)A y where R 2 = 
(A" — v)’A 一一识 ） has finite expectation. The r 2 -statistic ， T 2 = N(x - 
- |x), can be used for tests and confidence regions for |x when S 
(or A) is unknown, but the small-sample distribution of T 2 in general is 
difficult to obtain. However, the limiting distribution of T 2 when N — oo is 
obtained from the facts that y/N (Jf 一 |x ) 厶 N(0 5 1) and S A i (Theorem 
3.6.2). 
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Theorem 5.7.1. Let x l9 ... 9 x N be a sample from (1). Assume SR 2 < oo. 

Then T 2 车 x》. 

Proof. Theorem 3.6.2 implies that - |x) and N(x 

— |x)'2— — |x) — 0. ■ 

Theorem 5.7.1 implies that the procedures in Section 5.3 can be done on 
an asymptotic basis for elliptically contoured distributions. For example, to 
test the null hypothesis |x = |x ;) , reject the null hypothesis if 

( 2 ) >^( a ) s 

where Xp( a ) is the a-significance point of the ^ 2 -distribution with p degrees 
of freedom the limiting probability of (2) when the null hypothesis is true 
and N ^ oo is a. Similarly the confidence region N(x — mys~ 1 (JE - m) ^ 
Xp(ct) has liiTiiting confidence 1 — a. 

5*7.2* Elliptically Contoured Matrix Distributions 

Let X (NXp) have the density 

(3) £〆)，([ e〆，)- 1 ] 

based on the left spherical density Here Y has the representation 

Y = UR\ where U (N X p) has the uniform distribution on 0(N Xp\ R is 
lower triangular, and U and R are independent. Then X = + UR r C r . 

The r 2 -criterion to test the hypothesis v = 0 is Nx f S~ [ x 9 which is invariant 
with respect to transformations X^XG. By Corollary 4.5.5 we obtain the 
following theorem. 

Theorem 5.7.2. Suppose X has the density (3) with v — 0 and T 2 = 
Nx r S~ x x. Then [T 2 /(N - l)][(N -p)/p] has the distribution of F p N _ p = 
^Xp/p)Axs- P A^ 

Thus the tests of hypotheses and construction of confidence regions at 
stated significance and confidence levels are valid for left spherical distribu¬ 
tions. 

The r 2 -criterion for //: v - 0 is 

(4) r 2 = Nx l S~ x x^ Nu f S u l u 9 
since X^UR'C, 

(5) i 〜去 e’ji 〔去 e ’ w +U(C7 ?)'， 
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and ' 

(6) S = = 7v~lt Ci?t/，C/i?C，_Ci?55， ( c ^)1 

= CRS U (CR)'. 

5.7.3. Linear Combinations 

Lauter, Glirnm, and Kropf (1996a, 1996b. 1 Q96c) have observed that a statisti¬ 
cian can use X 1 X ^ CRR'C* when v = 0 to determine a p X (j nmtrix D and 
base a T-test on the transform Z = XD. Specifically, define 

( 7 ) z'-- l N e' N Z-x'D, 

( 8 ) S z ^ Mzz') - D'SD, 

( 9 ) T^Nl'S^z'. 

Since Q N Z = Q^U^C'^ UR f C'= Z, the matrix Z is based on the left- 
spherical YD and hence has the representation Z = VR^\ where V (N Xq) 
has the uniform distribution on 0(N Xp\ independent of R* 1 (upper 
triangular) having the distribution derived from f — Z^Z. The distribu¬ 
tion of T 2 /(N - 1) is F q 、 N — q q/Hq\ 

The matrix D can also involve prior information as well as knowledge of 
X f X. If p is large, q can be small; the power of the test based on may be 
more powerful than a test based on T\ 

Lauter, Glimm, and Kropf give several examples of choosing D. One of 
them is to chose Z) (p X 1) as [D\eig(X r X)~ where Diag ^ is a diagonal 
matrix with /th diagonal element a ir The statistic Tq is called the standard¬ 
ized sum statistic ； 

PROBLEMS 

5-1. (Sec. 5.2) Let x a be distributed according to N(\x+ (J(z a —i), 2), a = 
where z^(l/N)Zz a . Let b = [l/L(z a -z) 2 ]Zx a (z a - 1、S = 

〜 一无一 - z)][x a -x- b(z a - z)] r T and T 2 =■ L(z a -iYm Show 
that T 2 has the redistribution with N — 2 degrees of freedom* [ Him: See 
Problem 3-13J 

5*2. (Sec. 5.2.2) Show that T 2 /(N — 1) can be written as R l /([ - R 2 ) with the cor¬ 
respondences given in Table 5A. 
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Table S.l 


Section 5~2 

Section 4.4 

:o a = l/v^ 




4 2 ) 


y/Nx 

a (D = 

^ la zL 2) 

B^=Zx a ^ a 

A n = 

EzL 2) ^ 2)f 

1 = 

a il = 

^\ a 

T 1 

R 2 


N-l 

l-R 2 

P 

p-l 


N 

n 



S3. (Sec. 52 2) Let 


R 2 ^ U a X 'a(^ a Xf a y l l ： U a X a 

1 ~ r2 Lnl-Lu n xf„(Lx„x' n y l Lu a x a 


where u v — u N are N numbers and x u .., t x N are independent, each with the 
distribution N(Q, 2). Prove that the distribution of R 2 /(l - R 2 ) is independent 
of [Hint: There is an orthogonal N XN matrix C that carries 

into a vector proportional to (1/VN,-..,1/\W)J 

5.4. (Sec. 522) Use Problems 5.2 and 5.3 to show that [T 2 /(N - l)][(N — p)/p] 
has the F Pt ^^-distribution (under the null hypothesis). [Note ： This is the 
analysis that corresponds to Hotelling's geometric proof (1931)J 

5.5. (Sec. 522) Let T 2 = NHH ， where x and S are the mean vector and 

covariance matrix of a sample of N from /V(|a ， 2). Show that T 2 is distributed 
the same when px is replaced by A = (r,0,... ,0)’，where r 2 = and £ is 

replaced by I. 

5.6. (Sec. 5.2.2) Let u - [T 2 /{N - 1)]/[1 + T 2 /{N - 1)]. Show that u = 

yV f (W')^ ] Vy\ where y = , , 9 l/-/N) and 





… X IN ] 
x pN j 
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5.7. (Sec. 5.2.2) Let 

v?=v t , 



Prove that 1/ = j + (1 - jV, where 


(yO 2 _ (Y v i ) 2 

, 7 * ，-* '7 ， 

、- 1 卜 

i<J 

Hint: EV = K*, Where 


E 


5.8. (Sec. 522) Prove that w has the distribution of the square of a multiple 
correlation between one vector and p-1 vectors in (A^- l)-space without 
subtracting means; that is, it has density 




r[|(Ar-i)] 








[Hint: The transformation Problem 5J is a projection of v 2 ,. ， ” v p7 y on the 
(N - l)-space orthogonal to y r ] 

5.9. (Sec. 52,2) Verify that r = s/(l -s) multiplied by (A^ - 1)/1 has the noncen¬ 
tral ^-distribution with 1 and N - l degrees of freedom and noncentrality 
parameter Nr 2 . 
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5.10. (Sec. 5.2J2) From Problems 5.5-5.9, verify Corollary 5.2,1. 

5.11. (Sec. 5,3) Use the data in Section 32 to test the hypothesis that neither drug 
has a soporific effect at significance level 0.0L 

5.12. (Sec. 53) Using the data in Section 3-2, give a confidence region for |x with 
confidence coefficfem 0.95. 

5.13. (Sec. 5.3) Prove the statement in Section 5.3.6 that the T 2 -statistic is indepen¬ 
dent of the choice of C. 

5.14. (Sec. 5.5) Use the data of Problem 4.41 to test the hypothesis that the mean 
head length and breadth of first sons are equal to those of second sons at 
significance level 0,01* 

5.15. (Sec. 5.6.2) T 2 ~test as a Bayes procedure [Kiefer and Schwartz (1965)]. Let 

•Tp …， be independently distributed, each according to /V(|x, X). Let Il 0 be 
defined by [|Ji, 2] - [0,(/ 4- with t] having a density proportional to 

I/ 4- 'I ^ and let n ( be defined by [|jl, 2] = + + 

with nq having a density proportional to 

|/ + Tnt| , | ~ exp[|Nnn'(^ + " ?T l] * 

(a) Show that the measures are finite for N>p by showing r\ f {I + 別’) ■丨 s 1 

and verifying that the integral of 1/4-iqTn'l ~ = (1 -Mnnri ， )~ is finite. 

(b) Show that the inequality (26) is equivalent to Nx , ('£^iX a 
Hence the T 2 -test is Bayes and thus admissible. 

5.16. (Sec. 5.6.2) Let g(t) = f[ty { +(1 - where f(y) is a real valued function 
of the vector y. Prove that if g(r) is convex, then f(y) is convex. 

5.17. (Sec. 5.6.2) Show that z r B~ l z is a convex function of (z, B\ where B is a 
positive definite matrix. [Hint: Use Problem 5.16.] 

5.18. (Sec, 5.6.2) Prove that if the set A is convex, then the closure of A is convex. 

5.19. (Sec. 53) Let x and S be based on N observations from 2)，and let x 
be an additional observation from /V(|JL, 2). Show that x-x is distributed 
according to 

/V[0,(l+1//V)2]. 

Verify that [N/(N + l)](x - x) f S^ l (x - x) has the redistribution with N- l 
degrees of freedom. Show how this statistic can be used to give a prediction 
region for x based on x and S (i.e., a region such that one has a given 
confidence that the next observation will fall into it). 



PROBLEMS 


205 


5.20. (Sec. 53) Let be observations from X,), a — 1,..., f = 1,2, Find 

the likelihood ratio criterion for testing the hypothesis 

5.21. (Sec. 5.4) Prove that ijl'X'V is larger for ^ ^ 2 ) than for (jl = by 

verifying 

1 ^2 ^2 ] P’T OrPAPlM) 2 

P "^7 a^}~ a, 2 (1-p 2 )^ ‘ 


Discuss the power of the test = 0 compared to the power of the test /x 丨 = 0, 

/x 2 = 0, 

5.22. (Sec* 53) 

(a) Using the data of Section 5.3.4, test the hypothesis ^ = ^{\ 

(b) Test the hypothesis ^ 

5.23. (Sec. 5.4) Let 










^12 

^22, 


Prove iJL (1), 2riV (1) - Give a condition for strict inequality to hold. 

[Hint: This is the vector analog of Problem 5.21J 

5.24. Let X ⑺ ’ =Z (小 )， f = 1, 2, where has p components and Z lt) has q 
components, be distributed according to /V(|i ⑴， X)， where 





X = 



U、)’ 


1^,, 



1 , 2 . 


Find the likelihood ratio criterion (or equivalent T 2 -criterion) for testing px ( , h = 
pS 》、given = |X ( V ^ on the basis of a sample of N t on X (J \ i = 1,2. I Him: 
Express the likelihood in terms of the marginal density of and the 
conditiOiiar density of Z (n given F (, \] 

5.25. Find the distribution of the criterion in the preceding problem under the null 
hypothesis. 

5.26. (Sec. 5.5) Suppose is an observation from N(pi Kg \ 2^), a - 1_ N^. 

S ^ 1， • • •，分 * 
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(a) Show that the hypothesis 卜 ⑴ = … =is equivalent to Sy]^ ― 0, 
i = — 1， where 


= aV^" + L 


g 奴 2 


%, 


r(^> - 


+ E4 S ) + — 

’( N { N g Y 


…， / V !， i = 1, - 1; 

<N gt 豸 = 2,."，and (a^\..., a^ 0 ), i= l”..，q — 1， are linearly inde¬ 
pendent. 

(b) Show how to construct a r 2 ,test of the hypothesis using ( 歹 l ”V 
yielding an F-statistic with {q - l)p and N - (q - l)p degrees of freedom 
[Anderson (1963b)]. 


5.27. (Sec. 5.2) Prove (25) is the density ofV= XqA xl + Afg). [flint: In the joint 
density of U = Xa and W ™ xl make the transformation u - vw{\ - u)~ l t w = w 
and integrate out w.] 



CHAPTER 6 


Classification of Observations 


6,1, THE PROBLEM OF CLASSIFICATION 


The problem of classification arises when an investigator makes a number of 
measurements on an individual and wishes to classify the individual into one 
of several categories on the basis of these measurements. The investigator 
cannot identify the individual with a category directly but must use these 
measurements. In many cases it can be assumed that there are a finite num¬ 
ber of categories or populations from which the individual may have come and 
each population is characterized by a probability distribution of the measure¬ 
ments. Thus an individual is considered as a random observation from this 
population. The question is: Given an individual with certain measurements, 
from which population did the person arise? 

The problem of classification may be considered as a problem of u statisti¬ 
cal decision functions.” We have a number of hypotheses: Each hypothesis is 
that the distribution of the observation is a given one. We must accept one of 
these hypotheses and reject the others. If only two populations are admitted, 
we have an elementary problem of testing one hypothesis of a specified 
distribution against another. 

In some instances, the categories are specified beforehand in the sense 
that the probability distributions of the measurements are assumed com¬ 
pletely known. In other cases, the form of each distribution may be known, 
but the parameters of the distribution must be estimated from a sample from 
that population. 

Let us give an example of a problem of classification. Prospective students 
applying for admission into college are given a battery of tests; the sector of 
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scores is a set of measurements x. The prospective student may be a member 
of one population consisting of those students who will successfully complete 
college training or, rather, have potentialities for successfully completing 
training, or the student may be a member of the other population, those who 
will not complete the college course successfully. The problem is to classify a 
student applying for admission on the basis of his scores on the entrance 
examination. 

In this chapter we shall develop the theory of classification in general 
terms and then apply it to cases involving the normal distribution. In Section 
6.2 the problem of classification with two populations is defined in terms of 
decision theory, and in Section 6.3 Bayes and admissible solutions are 
obtained. In Section 6.4 the theory is applied to two known normal popula- 
tions，differing with respect to means, yielding the population linear dis¬ 
criminant function. When the parameters are unknown, they are replaced by 
estimates (Section 6.5). An alternative procedure is maximum likelihood. In 
Section 6.6 the probabilities of misclassification by the two methods are evalu¬ 
ated in terms of asymptotic expansions of the distributions. Then these devel¬ 
opments are carried out for several populations. Finally, in Section 6.10 linear 
procedures for the two populations are studied when the covariance matrices 
are different and the parameters are known. 


6.2. STANDARDS OF GOOD CLASSIFICATION 
6,2.1. Preliminary Considerations 

In constructing a procedure of classification, it is desired to minimize the 
probability of misclassification, or, more specifically, it is desired to minimize 
on the average the bad effects of misclassification. Now let us make this 
notion precise. For convenience we shall now consider the case of only two 
categories. Later we shall treat the more general case. This section develops 
the ideas of Section 3.4 in more detail for the problem of two decisions. 

Suppose an individual is an observation from either population or 
population 7 t 2 . The classification of an observation depends on the vector of 
measurements 工 ' =( 〜， ••• ， x p ) on that individual. We set up a rule that if an 
individual is characterized by certain sets of values of x Xyt .. y x p that person 
will be classified as from if other values, as from tt 2 . 

We can think of an observation as a point in a /7-dimensional space. We 
divide this space into two regions. If the observation falls in R {y we classify it 
as coming from population and if it falls in R 2 we classify it as coming 
from population tt 2 . 

In following a given classification procedure, the statistician can make two 
kinds of errors in classification. If the individual is actually from 7Tj, the 
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statistician can classify him or her as coming from population 7r 2 ; if from 7r 2 , 
the statistician can classify him or her as from We need to know the 
relative undesirability of these two kinds of misclassification. Let the cost of 
the first type of misclassification be C(2| 1) (> 0), and let the cost of mis- 
classifying an individual from tt 2 as from be C(l|2) (> 0). These costs 
may be measured in any kind of units. As we shall see later, it is only the 
ratio of the two costs that is important. The statistician may not know these 
costs in each case, but will often have at least a rough idea of them. 

Table 6*1 indicates the costs of correct and incorrect classification. Clearly 、 
a good classification procedure is one that minimizes in some sense or other 
the cost of misclassification. 


6.2.2. Two Cases of Two Populations 

We shall consider ways of defining * k mininium cost” in two cases. In One case 
we shall suppose that wc have a priori probabilities of the two populations. 
Let the probability that an observation conics from population tt x be and 
from population tt 2 be q 2 {q { q 2 = ^). The probability properties of popu¬ 
lation 7r, are specified by a distribution function. For convenience we shall 
treat only the case where the distribution has a density, although the case of 
discrete probabilities lends itself to almost the same treatment. Let the 
density of population 7r, be p { {x) and that of be p : {x). If we have a 
region of classification as from 7r {> the probability of correctly classifying 
an observation.that actually is drawn from population is 

(1) P(1|l ， i?)= f P t {x)dx. 

where dx = dx { dx py and the probability of misclassification of an observa^ 
tion from 7r l is 


(2) P(2|l,/?)= f Pl (x)dx. 

Similarly, the probability of correctly classifying an observation from is 

(3) P 2\2 ， R) = f p 2 (x) dx. 
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and the probability of misclassifying such an observation is 

(4) P(l\2,R) = f p 2 (x)dx. 


Since the probability of drawing an observation from ttj is q v the 
probability of drawing an observation from tt y and correctly classifying it is 
qiP(\\ 1, R )； that is, this is the probability of the situation in the upper 
left-hand corner of Table 6.L Similarly, the probability of drawing an 
observation from and misclassifying it is q x P(2\ 1, R). The probability 
associated with the lower left-hand corner of Table 6.1 is 尸 (:l|2, R\ and 
with the lower right-hand corner is q 2 P(2\2, R). 

What is tnc average or expected loss from costs of misclassification? It is 
the sum of the products of costs of misclassifications with their respective 
probabilities of occurrence: 

(5) C(2\l)P(2\l 9 R) qi ^C(l\2)P(\\2,R)q 2 ^ 

It is this average loss that we wish to minimize. That is, we want to divide our 
space into regions R Y and R 2 such that the expected loss is as small as 
possible. A procedure that minimizes (5) for given q x and q 2 is called a Bayes 
procedure. 

in the example of admission of students, the undesirability of misclassifica- 
tion is, in one instance, the expense of teaching a student who will not 
complete the course successfully and is, in the other instance, the undesirabil¬ 
ity of excluding from college a potentially good student* 

The other case we shall treat is that in which there are no known a priori 
probabilities* In this case the expected loss if the observation is from 7Tj is 


(6) C(2|l)P(2|l,i?)=r(l,i?); 
the expected loss if the observation is from tt 2 is 

(7) C(l|2)P(l|2 5 i?)=r(2, R). 

We do not know whether the observation is from or from 7 t 2 , and we do 
not know probabilities of these two instances* 

A procedure R is at least as good as a procedure R* if r(l, R) < r(l 5 K*) 
and r(2 y R) ^ r(2, /?*);/? is better than R* if at least one of these inequalities 
is a strict inequality. Usually there is no one procedure that is better than all 
other procedures or is at least as good as all other procedures. A procedure 
R is called admissible if there is no procedure better than R\ we shall be 
interested in the entire class of admissible procedures. It will be shown that 
under certain conditions this class is the same as the class of Bayes proce- 
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dures. A class of procedures is complete if for every procedure outside the 
class there is one in the class which is better; a class is called essentially 
complete if for every procedure outside the class there is one in the class 
which is at least as good. A minimal complete class (if it exists) is a complete 
class such that no proper subset is a complete class; a similar definition holds 
for a minimal essentially complete class. Under certain conditions we shall 
show that the admissible class is minimal complete. To simplify the discussion 
we shall consider procedures the same if they only differ on sets of probabil¬ 
ity zero* In fact, throughout the next section we shall make statements which 
are meant to hold except for sets of probability zero without saying so explicitly. 

A principle that usually leads to a unique procedure is the mininax 
principle* A procedure is minimax if the maximum expected loss ， r(i y R), is a 
minimum. From a conservative point of view, this may be consideied an 
optimum procedure. For a general discussion of the concepts in this section 
and the next see Wald (1950), Blackwell and Girshick (1954), Ferguson 
(1967)，DeGroot (1970), and Berger (1980b). 

6X PROCEDURES OF CLASSIFICATION INTO ONE OF TWO 
POPULATIONS WITH KNOWN PROBABILITY DISTRIBUTIONS 


6.3.1. The Case When A Priori Probabilities Are Known 

We now turn to the problem of choosing regions and R 2 so as to mini¬ 
mize (5) of Section 62. Since we have a priori probabilities, we can define joint 
probabilities of the population and the observed set of variables. The prob¬ 
ability that an observation comes from 7r x and that each variate is less than 
the corresponding component in y is 

(1) 广 …广 hp / xX ••〜 

^ *-00 ^ — 00 


We can also define the conditional probability that an observation came from 
a certain population given the values of the observed variates. For instance, 
the conditional probability of coming from population 7r ly given an observa¬ 
tion x, is 


( 2 ) 




9iPi(:) +iJ 2 Pz(^)' 

Suppose for a moment that C(l|2) = C(2| 1) = 1. Then the expected loss is 


<hf + 

R 2 


( 3 ) 
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then the Bayes procedure is unique except for sets of probability zero. 

Now we notice that mathematically the problem was: given nonnegative 
constants q x and q 2 and nonnegative functions p\(x) and p 2 (x), choose 
regions R x and R 2 so as to minimize (3). The solution is (5). If we wish to 
minimize (5) of Section 6.2, which can be written 

(8) [C(2\l) qi ]f Pl (x)dx+[C(l\2)q 2 ]f p 2 (x)dx, 

Ri 


This is also the probability of a misclassification; hence we wish to minimize 
the probability of misclassification. 

For a given observed point x we minimize the probability of a misclassifi¬ 
cation by assigning the population that has the higher conditional probability. 
If 


R \ P \ i x ) 




QiPi( x ) + ( hPi( x ) QiPi( x ) +QiP2^) ' 

we choose population Otherwise we choose population tt 2 . Since we 
minimize the probability of misclassification at each point, we minimize it 
over the whole space. Thus the rule is 

⑺ 尺 lUl ㈡以2尸2⑺， 

<Q2Pi{ x )- 

If qiPiix) = the point could be classified as either from tti or n 2 ; 

we have arbitrarily put it into R y . If q } pi(x) + g 2 p 2 (x) = 0 for a given x, that 
point also may go into either region. 

Now let us prove formally that (5) is the best procedure. For any proce¬ 
dure R * = ( Ry , the probability of misclassification is 

(6) 9i f + /2( x ) dx 




On the right-hand side the second term is a given number; the first term is 
minimized if R* includes the points x such that q\P Y {x) 一 g 2 p 2 (x) <0 and 
excludes the points for which q x pi(x) - g 2 /7 2 (x) > 0. If 


% . ...sf 


X X 
/V /V 

1 2 
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we choose R { and R 2 according to 

尺 i: [C(2|l)g 1 ]/?,(j：) > [C(l\2)q 2 ]p 2 (x), 

(9) 

R 2 :[C(2\l) qi } Pl {x)<[C{l\2) q2 }p 2 (x), 

since C(2| \)q x and C(l\2)q 2 are nonnegative constants. Another way of 
writing (9) is 

Pl (x) C(l!2)g 2 
I- F2(x) ~C(2\l) qi ' 

( 10 ) 

Pl (x) C(].|2)g 2 

c{i\\) qi - 


Theorem 6.3.1» If q y aad q 2 are a priori probabilities of drawing an 
observation from population mth density p x (x) and tt 2 with density p 2 (x) } 
respectively, and if the cost of misclassifying an observation from as from 
is C(2| 1) and an observation from tt 2 as from ir l is C(l|2), then the regions of 
classification and R 2 , defined by (10), minimize the expected cost. If 


(li) 


PM <hC(M2) \ 

qi C{2\^ ^'j 



1 , 2 , 


then the procedure is unique except for sets of probability zero. 


63.2. The Case When No Set of A Priori Probabilities Is Known 

In many instances of classification the statistician cannot assign a priori 
probabilities to the two populations. In this case we shall look for the class of 
admissible procedures, that is, the set of procedures that cannot be improved 
upon. 

First, let us prove that a Bayes procedure is admissible. Let R = R 2 ) 

he a Bayes procedure for a given q^q 2 \ ls there a procedure R* - (R*. R%) 
such that P(1\2,R*)<P(1\2 9 R) and P(2| 1, i?*) ^ P(2| 1, R) with at least 
one strict inequality? Since /? is a Bayes procedure, 

(12) qi P(2\l,R) + 92 P(l|2,i?) < 9l P(2|l,i?*) + 92 P( ： l|2,^). 

This inequality can be written 

(13) 9l [ 尸 (2|1 ，尺） - 尸 (2|1 ，尺 *)] <q 2 [P{l\2, R*)-P(1\2,R)\. 
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Suppose 0 <q y <l t Then if P(l|2,/?*) <P(1|2, i?), the right-hand side of 

(13) is less than zero and therefore P(2| 1, R) < P(2\ 1, R^). Then P(2| 1, i?*) 
<P{2\ L R) similarly implies P(l 12, i?) < P(1 \ 2, R* ). Thus R* is not better 
than i?, and R is admissible. If q x ^ 0, then (13) implies 0 < P(l|2, R*) — 
F(112, i?). Fora Bayes procedure, R x includes only points for which p 2 (x) = 0. 
Therefore, P(t\2 } /?) = 0 and if /?* is to be better P(t\2 y /?*) - 0, If Pr{p 2 (x) 
= 01 tt^O, then P{2\ 1, R) = Pr{p 2 (x) > 01 ttj) - 1. If 尸 (1|2,/?*) = 0, then 

contains only points for which p 2 (x) = 0. Then P(2\ 1, /?*) = Pr{i?*| 

- Vv{p 2 {x) > 01 开 ) = 1， and V is not better than R. 

Theorem 6.3.2, If Pr{ p 2 (^) = 017r L } - 0 and Pr( p x (x) = 017r 2 ) = 0, than 
every Bayes procedure is admissible. 

Now let us prove the converse, namely, that every admissible procedure is 
a Bayes procedure. We assume 1 

(14) Pr {pl(jc) = K 卜 °， 1 = 1,2, °^ /c ^ 00 - 

Then for any the Bayes procedure is unique. Moreover, the cdf of 
P\(x)/p 2 (x) for it t and 7r 2 is continuous. 

Let R be an admissible procedure. Then there exists a k such that 

(15) P(2|l,i?) = Pr|^-</c7r 1 J 

= 尸 (2|1 ，俨）， 

where R* is the Bayes procedure corresponding to q 2 /Q\ ^ ^ [i.e., q x = 1/(1 
+ k)]. Since R is admissible, P(l|2, i?) <P(1|2, R*) t However, since by 
Theorem 6.3.2 i?* is admissible, P(l|2, i?)>P(l|2, i?*); that is, P(l|2, i?)- 
F(112 ? i?*). Therefore, R is also a Bayes procedure; by the uniqueness of 
Bayes procedures R is the same as i?*. 

Theorem 6,3.3. If (14) holds，then every admissible procedure is a Bayes 
proccclutc. 

The proof of Theorem 6.3.3 shows that the class of Bayes procedures is 
complete. For if H is any procedure oulsidc Lhc class, we conslruct a Bayes 
procedure R* so that F(2| 1, R) ^ P(2\ 1, Then, since is admissible, 
P( 1.12. R)>P{l\2 y R*X Furthermore, the class of Bayes procedures is mini¬ 
mal complete since it is identical with the class of admissible procedures, 

i p ] (x)/p 2 (x) = cc means p 2 (jr) = 0. 
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Theorem 6.3A. If (14) holds, the class of Bayes procedures is minimal 
complete. 

Finally, let us consider the minimax procedure. Let P(i\j, q Y ) =P(i\j\ R\ 
where R is the Bayes procedure corresponding to q v is a continu¬ 

ous function of q t . P(2\ 1, q Y ) varies from 1 to 0 as q x goes from 0 to 1; 
尸 （ 1|2, q Y ) varies from 0 to L Thus there is a value of g” say gf, such that 
P(2\l y q*) — P(l\2,q*). This is the minimax solution, for if there were 
another procedure R* such that max{P(2| 1, /?*), P(l|2, R*)} <P(2\ 1, )= 

尸 (1|2, ), that would contradict the fact that every Bayes solution is admissi¬ 
ble. 


6.4. CLASSIFICATION INTO ONE OF TWO KNOWN MULTIVARIATE 
NORMAL POPULATIONS 


Now we shall use the general procedure outlined above in the case of two 
multivariate normal populations with equal covariance matrices, namely, 
and N(jjl( 2) , 2)，where =(〆/)，•••，^°) is the vector of means 
of the ith population, i = 1 ， 2, and X is the matrix of variances and covari¬ 
ances of each population. [The approach was first used by Wald (1944)*] 
Then the ith density is 

(!) PM = ^^^了哪卜 — #))]. 

The ratio of densities is 


Pi(j) = exp- M 1 ))] 

Pii x ) exp[-i(x- Jji( 2) )j 

=exp{ —jt ⑴） 
-(x- Jt (2) )'2 _l (x- 


The region of classification into tt { , R it is the set of x’s for which (2) is 
greater than or equal to k (for k suitably chosen). Since the logarithmic 
function is monotonically increasing, the inequality can be written in terms of 
the logarithm of (2) as 


(3) jjl U) )^ _I ( x ^ ^ l) ) 一 ( x 一 ^ 2) ) ，5 -~ I ( x ~ V 2 ))] > logfc. 
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The left-hand side of (3) can be expanded as 

(4) - 吾 [x’rL-xl—V ⑴一 4 — jjl (1)t 2 _ V (1) 

—x f X~ l x +X ， 1 t ^ l [L (2) + [JL (2)/ 2 _1 X — JJL (2) ’2 _ V (2) ]* 

By rearrangement of the terms we obtain 

(5) jdV ⑴一 ijl ( 2 ) 卜 士 + pyx-v l) - #)• 

The first term is the well-known discriminant function. It is a function of the 
components of the observation vector 

The following theorem is now a direct consequence of Theorem 6.3 丄 


Theorem 6.4 丄 If has the density (1), i * 1,2, the best regions of 
classification are given by 

R { : ⑴- jt (2) ) - }(^ (1) 4 - jt (2) ) , 2* 1 (jt (1) - |jl ⑵）之 log /c, 

() R 2 : ⑴- P (2) ) - + ^ (2) )'2 -1 (^ (1) - jjl (2) ) < log /c. 

If a priori probabilities q x and q 2 are known, then k is given by 
() ^CX2|l) * 


In the particular case of the two populations being equally likely and the 
costs being equal，/c — 1 and log /c = 0, Then the region of classification into 
7T X is 

(8) R } : x'X — 1 〆 1 ) 一 > 士 + |x (2 〕 ) f 2— ⑴一 jt (2) )* 


If we dc not have a priori probabilities, we may select log k^c y say, on the 
basis of making the expected losses due to misclassification equal. Let Jl be a 
ranaom observation. Then we wish to find the distribution of 


(9) (/ =JTI 一 Y〆 1 ) 一 》 t( 2) ) — |(jt (1) + 

on the assumption that X is distributed according to 2) and then on 

the assumption that X is distributed according to IX When X is 

distributed according to N(yS l \ 2), U is normally distributed with mean 

(10) ^,(/= jt (I), X _1 ([JL (1) - ^ (2) ) - … 1} + jt (2) )'2 - 弋 ^ ⑴ -jt u) ) 
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and variance 


(11) V ar| ((/) = - ^ (1) )(X- W2 -.'(jjl ⑴ 

TKe MaKalanobis squared distance between Mjjl"), 2) and 2) is 

(12) ( 〆 > -f )1— 1 (f — #，）=&， 

say. Then U is distributed according to A 2 ) if X is distributed 

according to ⑴， 2). If X is distributed according to 2)，then 

(13) ^ 2 t/=jt (2), 2- 1 (^ (1) - jjl [ 2) ) - *(〆” + jjl( 2> )’2 —W" - ^ ：1 ) 

=|(JJL ⑵一 JJL ( 1 ))’ 叉 — 1 (JJL ⑴一 JJL ⑺） 

=- if 

The variance is the same as when X is distributed according to A^(jt (n .X) 
because it depends only on the second-order moments of X. Thus U is 
distributed according to N(- jA 2 , A 2 ). 

The probability of misclassification if the observation is from 7r, is 

(14) P{2\ 1) = ^ ^ ，Vi： dz - 办， 


and the probability of misclassification if the observation is from tt 2 is 


(15) P(l|2) 


\Z2tt A 


dz x 


^(c + u 2 )/^ 4liT 


iv- 




Figure 61 indicates the two probabilities as the shaded portions in the tails 
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For the minimax solution we choose c so that 


(16) C(l|2)/' 


+ 4A-)/A yflrr 


卜 : 办 = C(2U)/ 


(_ - 


yflir 


，士 


dy 、 


Theorem 6.4.2. If the have densities (1)， / = 1,2, the, minimax regiofts of 
classification are given by (6) where c = log k is chosen by the condition (16) with 
C(i\j) the two costs of misclassificatioih 


It should be noted that if the costs of misclassification are equal, c = 0 and 
the probability of misclassification is 


(17) 




yj2ir 


e~^ 





In case the costs of misclassification are unequal, c could be determined to 
sufficient accuracy by a trial-and-error method with the normal tables. 

Both terms in (5) involve the vector 


(18) 

This is obtained as the solution of 
(19) X8- 

by an efficient computing method* The discriminant function x’8 is the linear 
function that maximizes 


( 20 ) 


[^(X'd)-^ 2 (X'd)} 2 

Var(X ， rf) 


for all choices of d. The numerator of (20) is 


( 21 ) 

the denominator is 


(22) d r <^(X- SX){X- SXyd^d^d. 

We wish to maximize (21) with respect to d 9 holding (22) constant. If A is a 
Lagrange multiplier, we ask for the maximum of 

⑴ 一 — 1). 


(23) 
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The derivatives of (23) with respect to the components of d are set equal to 
zero to obtain 

(24) 2[0 ⑴一 （ ji (2 ))(V 1 ) — m/ 2 ))，]rf = 2A2rf. 

Since (p ⑴一 yS 2) yd is a scalar, say v, we can write (24) as 

(25) 

Thus the solution is proportional to 8. 

We may finally note that if we have a sample of N from either tt x or tt 2 , 
we use the mean of the sample and classify it as from AT[jjl (1) ，（ 1/A0S] or 

MP ， (1/A0S ]、 

6.5. CLASSIFICATION INTO ONE OF TWO MULTIVARIATE NORMAL 
POPULATIONS WHEN THE PARAMETERS ARE ESTIMATED 

6.5.1. The Criterion of Cla; sification 

Thus far we have assumed that the two populations are known exactly、In 
most applications of this theory the populations are not known, but must be 
inferred from samples, one from each population. We shall no^v treat the 
case in which we have a sample from each of two normal populations and we 
wish to use that information in classifying another observation as coming 
from one of the two populations. 

Suppose that we have a sample x[ x \ "• ， from S) and a sample 

x ( i\ ♦ • • ， from N(yS 2 \ S). In one terminology these are “training samples.” 
On the basis of this information we wish to classify the observation x as 
coming from 7^ to Clearly, our best estimate of jjl ⑴ is = Ef 1 
of is x (2) = E^ 2 x^ 2) /AT 2 , and of S is S defined by 

(1) {N l+ N 2 -2)S- E 

a= I 

+ E K 2) -^)(^-^ 2 >)-. 

a= l 

We substitute these estimates for the parameters in (5) of Section 6*4 to 
obtain 
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The first term of (2) is the discriminant function based on two samples 
[suggested by Fisher (1936)]. It is the linear function that has greatest 
variance between samples relative to the variance within samples (Problem 
6.12). We propose that (2) be used as the criterion of classification in the 
same way that (5) of Section 6.4 is used. 

When the populations are known, we can argue that the classification 
criterion is the best in the sense that its use minimizes the expected loss in 
the case of known a priori probabilities and generates the class of admissible 
procedures when a priori probabilities are not known ‘ We cannot justify the 
use of (2) in the same way. However, it seems intuitively reasonable that (2) 
should give good results. Another criterion is indicated in Section 6.5.5. 

Suppose we have a sample from either ir x or 7 t 2 , and we wish 

to classify the sample as a whole. Then we define S by 

(3) (Ny+N 2 +N-3)S= E 

a™ 1 

+ EK )-#)(#- ， 2 )卜 L ( x a - x )( x a - x )', 

a = 1 or= 1 

where 

i 尺 

(4) 

Then the criterion is 

(5) [i- +i (2) )] , S _1 (i (1) -x (2) ). 

The larger N is, the smaller are the probabilities of misclassification, 

6.5.2. On the Distribution of the Criterion 
Let 

(6} W = X f S ~ ] ( X ([) - X (2) ) - 士 （ X (u + 无⑺ 丨（无 ⑴一 X (2) ) 

= [x- |(X (I) +X f 2 ) )] , S- l (^ a, -x (2 ') 


for random X y X^ l \ X i2 \ and S. 
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The distribution of W is extremely complicated It depends on the sample 
sizes and the unknown A 2 . Let 


⑺ F, = Cl [尤 -(M +N 2 y\N x X^+N 7 X^)\, 

( 8 ) Y 2 =c,(X^-X^) t 


where c x = yj\N x + N 2 )/(' +Nj + 1) and c 2 = ^/N,N 2 /( N v + N:) • Then 
h and Y 2 are independently normally distributed with covariance matrix X. 
The expected value of Y 2 is c 2 (fJi (l) - n (2) ), and the expected value of F, is 
c ] [N 2 /(N l + N 2 )](n ⑴一 n (2) ) if X is from and -c l [N l /(N l +N ； ! )Km>" 1 — 
pi (2) ) if X is from tt 2 . Let K = (F, F 2 ) and 


(9) 


Then 


M = rs 1 r= 

m 12 


(m 21 m 22 


( 10 ) 




I Ni+N 2 + 1 n v -n, 

V ~ m ' 2 + ~2N\N7 m '~- 


The density of M has been given by Sitgreaves (1952). Anderson (1951a) and 
Wald (1944) have also studied the distribution of 

If N { = N 2 , the distribution of W for X from Vj is the same as that of 
— IV for X from tt 2 . Thus, if ^ > 0 is the region of classification as then 
the probability of misclassifying X when it is from is equal to the 
probability of misclassifying it when it is from ir 2 . 


6,5J. The Asymptotic Distribution of the Criterion 

In the case of large samples from X) and N(i^ 2) 7 2)，we can apply 

limiting distribution theory. Since 戈 (1 ) is the mean of a sample of 
independent observations from X), we know that 

(11) plim X (l) = ^\ 

The explicit definition of (11) is as follows ： Given arbitrary positive 8 and e. 
we can find N large enough so that for ^ { >N 


( 12 ) 


<8, i=U... t p)>l-e. 
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(See Problem 3.23J This can be proved by using the Tchebycheff inequality. 


Similarly. 


(13) 

plim X {2) = jt (2) , 

and 


(14) 

plim S - X 

as N { x, A^ 2 ex 

or as both — From (14) we obtain 

(15) 

plimS -1 =I _1 , 


since the probability limits of sums, differences, products, and quotients of 
random variables are the sums, differences ， products, and quotients of their 
probability limits as long as the probability limit of each denominator is 
different from zero [Cramer (1946), p. 254]. Furthermore, 

(16) plitn S _1 ( A^ (1) - X {2) ) = S _l - M ^ (2) )， 

(17) 

plim ( 无⑴ ^x (2) ys~ l (x {i) -X {2) ) = (»t (1) + 一 jjl (2) ). 

It follows then that the limiting distribution of W is the distribution of V. 
For sufficiently large samples from 7r, and 7 t 2 we can use the criterion as if 
we knew the population exactly and make only a small error. [The result was 
first given by Wald (1944).] 

Theorem 6.5.1. Let W be given by (6) with the mean of a sample of 

from JV(jt (l) , 2), X {2) the mean of a sample of N 2 from N(yS 2 \ 2), and S the 
estimate of 'l based on the pooled sample. The limiting distribution of W as 
% — co and N 2 — oo is A 2 ) if X is distributed according to N(yS { \ S) 

and is NX - A 2 ) ifX is distributed according to iV(fJt (2) , S). 

6.S.4. Another Derivation of the Criterion 

A convenient mnemonic derivation of the criterion is the use of regression of 
a dummy variate [given by Fisher (1936)]. Let 

( 18 ) y « )= °{ = 1 ，…， yT = » a=l,...,N 2 . 
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Then formally find the regression on the variates by choosing b to 
minimize 

(i9) l i i)r ， 

/ = I a-l 

where 


( 20 ) 


^ 1+^2 


The normal equations are 


(21) E i(x^-x)(x^-x)'b= i 

/« I a e a - I 


^2 

The matrix multiplying b can be written as 


[(i (1) -jf) - (i( 2 ) 
(i ⑴一无 ⑵）. 



7 /V, 


(22) 

I ： E w)-5)(#.)-^ 

i-I a=l 



2 N: 



=E 

i — l a = I 



+ N l (x ii) -x)(x (l) -xy 

+ N 2 (i (2) -i)(Jc (2) -x) 


2 N t 



=E E (W-K)' 

~x^)' 


+ 7^ V jf ⑴- jf(2 ))( i ⑴- i ⑵) , . 


Thus (21) can be written as 

(23) Ab = - ⑴- 
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where 

(24) A=i E {x^-x^)(x^-x^y. 


Since (x (,) 一 x (2) )’fi is a scalar, we see that the solution b of (23) is propor¬ 
tional to - i (2) ). 


6.5.5. The Likelihood Ratio Criterion 

Another criterion which can be used in classification is the likelihood ratio 
criterion. Consider testing the composite null hypothesis that x ? x \ l \..,, 
are drawn from iV(p ⑴， 2) and x ( t 2) ? ..., are drawn from Mjjl (2) ， 2) 
against the composite alternative hypothesis that x\ l \ • _ ■ ， x { ^ are drrwn from 
S) and x, xp,..., are drawn from N{\l i2 \ 2), with 〆’) ， fjL (2) , and 
2 unspecified. Under the first hypothesis the maximum likelihood estimators 
of jjl ( 2) , and X are 


(25) 


时 ) = 


N^+x 
+ 1 


玄 1 = JV +7V + 1 ^ (4"— 以 - 的 U )' W 1 ))’ 

1 Z Of — 1 

+ E (j ： L 2 ) -M- ( i 2 ) )(^ 2 ) -^i 2) ) 

Since 



(26) E (X l j) - PV))(4) — 的 ")'+ (x-}i\ ]) )(x - 

a~l 

=E (4) - if ⑴ ⑴)， +A\(jf ⑴—时 ))(i ⑴—糾 ))’ 

a ^ 1 L 

+ (x— 的 1 ))’ - 

=E (4 1 ) - 无⑴ )(4 l) - 无 ⑴) , + ( 卜 5 ⑴ )( 文 -f 1 ")、 

a=l 1 
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we can write as 


(27) 


S, 


1 


Ny+^+l 


N 




where A is given by (24). Under the assumptions of the alternative hypothesis 
we find (by considerations of symmetry) that the maximum likelihood estima¬ 
tors of the parameters are 


(28) 


Ct a)^ N 2 i(2)+x 




N 2 + 
1 


4+7^1(文_无( 2 )：)(文一无 (2 、：)， 


， 2 一 n ， n 2 + 

The likelihood ratio criterion is, therefore，the (+ N 2 + l)/2th power of 


(29) 


1S 2 I 


4 + — i(2> )( x — 无 ⑵) ， 


4 + 77^1&一无 (1> )(文一无 ⑴） ， 


This ratio can also be written (Corollary A.3.1) 


(30) 


-^(x^x^yA-^x-x^) 
N 2 


n 


n 


jV 2 ； ! {x-x^ys^ix-x^) 
J^ix-x^ys-^x-x^) 


where n — N { + N 2 — 2. The region of classification into v { consists of those 
points for which the ratio (30) is greater than or equal to a given number K n . 
It can be written 


(31) 


N, 


Ki-n + 


4 


n 
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if = 1 4 - 2c/n and N l and N 2 are large, the region (31) is approximately 
W{x) > 

If we take K n = 1, the rule is to classify as tt v if (30) is greater than 1 and 
as 7 t 2 if (30) is less than L This is the maximum likelihood rule. Let 

Then the maximum likelihood rule is to classify as tt v if Z > 0 and tt 2 if 
Z < 0. Roughly speaking, assign x to 7 t l or 7 t 2 according to whether the 
distance to i ⑴ is less or greater than the distance to x i2 \ The difference 
between W and Z is 

(33) ^-Z=^ jf ⑵） , S—U—i ⑵） 

- TT ^ TlG - 郎 _1 ( 文-叫 ， 

which has the probability limit 0 as N ly N 2 oo. The probabilities of misclas- 
sification with W are equivalent asymptotically to those with Z for large 
samples. 

Note that for = N 2 -Z = [A^i/(^ + 1)]^. Then the symmetric test 
based on the cutoff c = 0 is the same for Z and W. 


(32) Z = 去 


6.5.6. Invariance 

The classification problem is invariant with respect to transformations 

= Bx^ + c, a= 

(34) x 1 -^* = Bx ( ^ + c, a= \ ， ... ， N 2 , 

x* = Bx + c 7 

where B is nonsingular and c is a vector. This transformation induces the 
following transformation on the sufficient statistics: 

x (2) * = 历 (2) + c ， 

5* =BSB，, 


(35) 


x (1) * ⑴ +c ， 

x* = Bx + c, 


with the same transformations on the parameters, and X. (Note 

that Sx = |x ⑴ or jx (2) .) Any invariant of the parameters is a function of 
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A 2 =(〆” 一 tx (2) )'S 一 K〆 1 ) — |x (2) ). There exists a matrix B and a vector c 
such that 

(36) p ⑴* ⑴ + c = 0, +c = ( A ’ O ，".，。)、 

Therefore, A 2 is the minimal invariant of the parameters. The elements of M 
defined by (9) are invariant and are the minimal invariants of the sufficient 
statistics. Thus invariant procedures depend on M, and the distribution of M 
depends only on A 2 . The statistics W and Z are invariant. 

6.6. PROBABILITIES OF MISCLASSIFICATION 

6.6 丄 Asymptotic Expansions of the Probabilities of Misclassification 
Using W 

We may want to know the probabilities of misclassification before we draw 
the two samples for determining the classification rule, and we may want to 
know the (conditional) probabilities of misclassification after drawing the 
samples. As observed earlier, the exact distributions of W and Z are very 
difficult to calculate. Therefore, we treat asymptotic expansions of their 
probabilities as N { and N 2 increase. The background is that the limiting 
distribution of W and Z is A 2 ) if x is from tt, and is M— | A 2 , A 2 ) if 

x is from tt v 

Okamoto (1963) obtained the asymptotic expansion of the distribution of 
W to terms of order n—\ and Siotani and Wang (1975,1977) to terms of 
order [Bowker and Sitgreaves (1961) treated the case of ^ N 2t ] Let 
<!>(•) and <f>(-) be the cdf and density of MO, 1), respectively. 

Theorem 6.6.1* As N x oo, N 2 oo, and — a positive limit (n = 

M +乂- 2)， ^ 」 

(!) Pr |i^2A < u TTjj 

= 中 (《) — + (p- 3)« —pA] 

+ ^^[ w3+2Aw2 + ( ；? - 3 + A2 ) U + ( ；7 - 2 ) A ] 

+ 去 [4u 3 + 4Au 2 + [6p - 6 + A 2 )u + 2(p - 1)A]| + 0(n~ 2 ), 
and Pr{—^A 2 )/A < u\ n 2 } is (1) with and N 2 interchanged. 
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The rule using W is to assign the observation x to tt x if W{x) > c and to 
7 t 2 if W{x) < c. The probabilities of misclassification are given by Theorem 
6.6,1 with w ; (c — 姜 A 2 )/A and m = — （c + |A 2 )/A, respectively. For c = 0, 
w = — 士 A. If AT 】 =N 2 , this defines an exact minimax procedure [Das Gupta 
(1965)]. ‘ 

Corollary 6.6.1 

(2) Pr 卜。货 =l} 

= $(-!△) + 去 f ■厶 j 

H^> 0| tt 2 , lim ^ == 1 

«-^oo 1^2 

Note thai the correction term is positive，as far as this correction goes; 
that is，the probability of misclassification is greater than the value of the 
normal approximation. The correction term (to order 行 -1 ) increases with p 
for given 厶 and decreases with A for given p. 

Since 厶 is usually unknown, it is relevant to Studentize W. The sample 
Mahalanobis squared distance 

(3) D 2 ^ (x {l) -x {2) ) 

is an estimator of the population Mahalanobis squared distance A 2 . The 
expectation of D 2 is 

( 4 ) n-p-l + ^(/^ + 7 ^)/ 

See Problem 6.14. If N x and N 2 are large, this is approximately A 2 . 
Anderson (1973b) showed the following: 

Theorem 6.6.2* //iVj/iV 2 — fl positive limit as n — oo, 

(5) Pr| W ~~J D < u tt-jJ 

= — ( 誓一 + | 安 + 卜一 } +0(n~ 2 ). 
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Usually, one is interested in w < 0 (small probabilities of error). Then the 
correction term is positive; that is, the normal approximation underestimates 
the probability of misclassification. 

One may want to choose the cutoff point c so that one probability of 
misclassification is controlled. Let a be the desired Pr{H^ < c\ it,}. Anderson 
(1973b, 1973c) derived the following theorem: 


Theorem 6.6.3. Let u 0 be such that = a 9 and lei 

m 、 1 fp- 1 11 )\( 3\ 1 / 

(7) - 2 W(, ] + n 4j W|) + 4 I( " - 

Then as oo, N 2 co, and /N 2 ci positive limit, 

(8) Pr| <u TTjj = a + 0(n -2 ). 

Then c = Du ^D 2 will attain the desired probability a to within — 
We now turn to evaluating the probabilities of misclassification after the 
two samples have been drawn. Conditional on x (l \ x {2 \ and S, Lhc random 
variable W is normally distributed with conditional mean 

(9) <^(W]7r n x {i \x {2 \S) - -x^) 

when x is from 7r,, r = 1,2, and conditional variance 

(10) 7( 則无⑴，无 ( 2) ， S ) = ( 无⑴一 (知 ”一夭❾） 

^ o- 2 (x {l \x a \S), 

Note that these means and variance are functions of the samples with 
probability limits 

plim x {2 \ S) = ( - l)* — 1 士 A 2 ， 

(U) ~ 

plim a 2 (x (l \x a \S)^Ar. 

N,, 
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For large and N 2 the conditional probabilities of misclassification are 
close to the limiting normal probabilities (with high probability relative to 
x^\ and Si 

When c is the cutoff point，probabilities of misclassification conditional on 
x [l \ x {2 \ and S are 


( 12 ) 

(13) 


P(2\ l ， c ， i ⑴， = $ 


c — /x (l) (x (1 \ x (2 \ S) 
a(x {l \ x {2 \ S) 


P(l.|2 ， c ， jf (l ) ， jE( 2 ) ， S)= 1—0 


a(x {l \x i2 \S) 


In (12) write c as DU\ + jD 2 . Then the argument of $(•) in (12) is 
it { D/o-+ (x (1) — Jc ( ： , ) , 5" 1 (x (1) — |x (l) )/(T ； the first term converges in probabil¬ 
ity to Up the second term tends to 0 as iV, -> oo, N 2 oo, and (12) to 0(1^). 


In (13) write c as Du 2 - jD 2 \ Then the argument of $(•) m (13) is 
( i ⑴一 — (元— | x ( 2 、)/ o \ The first term converges in proba¬ 
bility to it 2 and the .secx>nd term to D; (13) converges to 1 - $(u 2 ). 

For given x {l \ x {2 \ and S the (conditional) probabilities of misclassifica¬ 
tion (12) and (13) are functions of the parameters |x (1) , |x (2) , I and can be 


estimated. Consider them when c = 0. Then (12) and (13) converge in 
probability to 0( — 士 A); that suggests 0( - \D) as an estimator of (12) and 
(13). A better estimator is 0( — \D\ where D 2 = (n — p - 1)D 2 /n, which 
is closer to being an unbiased estimator of A 2 . [See (4).] McLachlan 
(1973,1974a, 1974b, 1974c) gave an estimator of (12) whose bias is of order 
n ^ 2 ： it is 


(14) d>(_4D)+c^(^){^ + 3^-[-D 3 + 4(4p-l)^]|. 


[McLachlan gave (14) to terms of order n" 1 .] McLachlan explored the 
properties of these and other estimators, as did Lachenbruch and Mickey 
(1968). 

Now consider (12) with c = Du x + \D 2 ; «, might be chosen to control 
P(2| 1) conditional on x (,) , x (2 \ 5, This conditional probability as a function of 
x( n ， x {2 \ S is a random variable whose distribution may be approximated. 
McLachlan showed the following: 

Theorem 6,6.4. As oo, N 2 — oo, and /N 2 a positive limit. 


(15) Pr 


,X' 1 % X 1 -% 5) -!])(»,) 


1 ^P(2\\,Du l + ^D\x i]) ,x i2) 






(p- "W o — i +n/Wi)»i -t^i/4 


+ 0(n~ 2 ). 
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McLachlan (1977) gave a method of selecting u { so that the probability of 
one misclassification is less than a preassigned 5 with a preassigned confi¬ 
dence level l — s. 


6.6.2. Asymptotic Expansions of the Probabilities of Misdassification 
Using Z 

We now turn our attention to Z defined by (32) of Section 6.5. The results 
are parallel to those for W. Memon and Okamoto (1971) expanded the 
distribution of Z to terms of order n- 2 , and Siotani and Wang (1975), (1977) 
to terms of order 打 


Theorem 6.6.5. As N { oo, N 2 ex, and N { /N 2 approaches a posi f ive 
limit ， 


(16) Pr(^^<« 


TTi 


$( W) — ' 


2 举 2 


u 3 + Au 2 + (/? - 3)u — A] 


2N 2 A 2 


u 3 + Au 2 + (p - 3 - A 2 )u - A 3 - A] 


+ ^ [4u 3 + 4Au 2 + (6p - 6 + A 2 )u + 2( p - 1) A] ] +0(n~ 2 ), 


and Pr{ — (Z + ^A 2 )/A < u\ 7r 2 } is (16) with N' and N 2 interchanged. 

When c = 0, then u = - ^ A. If iV, = N 2 , the rule with Z is identical to the 
rule with W y and the probability of misclassification is given by (2), 

Fujikoshi and Kanazawa (1976) proved 

Theorem 6.6.6 


(17) Pr { Z ~ £) 2 ^ ' ^ U7r l} 

= $(u)- 沴 (u){y^[u 2 +Au_ (p_l)] 
_ [° 2 + 2/ ^ u +P ~ 1 + ^ 2 ] 

+ 去 [ u3 + ( 4 P- 3 ) u ]} +°(«~ 2 )> 
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(18) Pr{-^^<u|7r 2 } 

-$(u) -c/,(u)|-27 ^-[u 2 + 2Au +p - 1 + a 2 ] 

+ I ^[“ 2 + A “- (P - 叫 + 47f[« 3 + (4p-3)u]J +0(«- 2 ) 


Kanazawa (1979) showed the following: 

Theorem 6.6.7. Let u 0 be such that <E>(w 0 ) = a, and let 

(19) u = u 0 + y^[ug+Du 。 -(p-l)] 

一 2N Z 'D i u o +Du o + (p - 1) -D 2 ] 

+ 4^-[ u o + (4/?-5)u 0 ]. 

Then as N { — oo ， iV 2 — oo, and N x /N 2 ^ a positive limit, 

(z — hn 2 、 

(20) Pr< -— <.u \ - a + 0(n -2 ). 

Now consider the probabilities of misclassification after the samples have 
been drawn. The conditional distribution of Z is not normal; Z is quadratic 
in x unless N { = N 2 . We do not have expressions equivalent to (12) and (13). 
Siotani (1980) showed the following: 


Theorem 6,6*8* As N' — oo, _/V 2 — oo ，and — a positive limit, 

(2l) Pr ( 2 '/ Z ^ ， i ， °，无⑴，妒 )， j 

(} \1 ^ 1+^2 吨 A) 鬥 

x ~ 2 ]j {t6^a1 4 ^~ 

+ t^1 4 ^ _1 ) +3a 2 1 _ (/? 4« 1)a } +°( n " 2 )- 

It is also possible to obtain a similar expression for P(2| i 5 Du x 
士 D 2 , x {X \ x {2 \ S) for Z and a confidence interval. See Siotani (1980). 
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Let us now consider the problem of classifying an observation into one of 
several populations. We shall extend the consideration of the previous 
sections to the cases of more than two populations. Let be m 

populations with density functions p l (x) 1 ... > p m (xX respectively. We wish to 
divide the space of observations into m mutually exclusive and exhaustive 
regions / 尺 , rt . If an obscivation falls into R n we shall .say that it conies 
from 77 r . Let the cost of misclassifying an observation from as coming from 
ttj be C(/|i). The probability of this misclassification is 


(1) (社 


Suppose we have a priori probabilities of the populations, q'q m . Then 
the expected loss is 


( 2 ) 


Ed Lcu\i)P(j\i,R) 


We should like to choose R nt to make this a minimum. 

Since we have a priori probabilities for the populations, we can define the 
conditional probability of an observation coming from a population given 
the values of the components of the vector x. The conditional probability of 
the observation coming from it, is 

q,p ,{ x ) — 

If we classify the observation as from v Jt the expected loss is 


( 4 ) 


rti 

E 

/=* L 


q,p,{ x ) 

^iqkPk(^) 


CUU). 


We minimize the expected loss at this point if we choose j so as to minimize 
(4); that is, we consider 


( 5 ) 
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for all j and select that / that gives the minimum. (If two different indices 
give the minimum, it is irrelevant which index is selected.) This procedure 
assigns the point 尤 to one of the Rj. Following this procedure for each x, we 
define our regions The classification procedure, then, is to 

classify an observation as coming from tt } if it falls in R r 

Theorem 6.7.1. If q, is the a priori probability of drawing an observation 
from population with density p^xX i — 1， . • • ， m ，and if the cost of misclassify- 
ing an observation from v) as from ttj is C(j\i\ then the regions of classifica¬ 
tion, R m , that minimize the expected cost are defined by assigning x to 

W 

( 6 ). £ ^PX x )C{k\i) < £ / = j *k. 

,l fsr I 

i i^k j 句 


[// (6) holds for all j (j k) except for h indices and the inequality is replaced by 
equality for those indices, then this point can be assigned to any of the h + l tt's.] 
If the probability of equality between the right-hand and left-hand sides of (6) is 
zero for each k and j under (each then the minimizing procedure is unique 
except for sets of probability zero. 


Proof, We now verify this result. Let 

( 7 ) p { {x)C{j\i). 


Then the expected loss of a procedure R is 

(8) £ f^h^x) dx = Jh(x\R)dx, 

where h(x\R) = /i/x) for x in R y For the Bayes procedure R* described in 
the theorem, h(x\R) is h{x\R*) = min,. h t (x\ Thus the difference between 
the expected loss for any procedure R and for R* is 

(9) J [/z(o:|^) - h(x\R^)] dx= ^ J [hj(x) - min/z^x)] dx 

j r j 

Equality can hold only if h t {x) — min, h t (x) for x in Rj except for sets of 
probability zero. _ 
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Let us see how this method applies when C{j\i) = 1 for all / and /， i #/• 
Then in R k 

(10) IlqiPM < HqiPM ， j 

i^k i ■ 丰 i 

Subtracting T!P^\ >i ^ k ] q l p l {x) from both sides of (10), we obtain 

( 11 ) ^jPj(x) <q k Pk{x), 

In this case the point x is in R k if k is the index for which q^pf^x) is a 
maximum; that is, 7r k is the most probable population. 

Now suppose that we do not have a priori probabilities. Then we cannot 
define an unconditional expected loss for a classification procedure. How¬ 
ever, we can define an expected loss on the condition that the observation 
comes from a given population. The conditional expected loss if the observa¬ 
tion is from % is 

(12) tc(j\i)P(j\i,R)-r(i 9 R). 

/ =i 

A procedure R is at least as good as if r(/, R) < r(/, /?*), / = 1,..., m; R 
is better if at least one inequality is strict, R is admissible if there is no 
procedure that is better. A class of procedures is complete if for every 

procedure R outside the class there is a procedure /?* in the class that is 
better. 

Now let us show that a Bayes procedure is admissible. Let 尺 be a Bayes 
procedure; let i?* be another procedure. Since R is Bayes, 

(13) Lq/(i,R)< 

/ =1 ， = I 

Suppose q x > 0, 分 2 > 0, r(2, /?*) < r(2, R\ and r(i, R*) <r(/, 尺 )， / = 3,… ， m. 
Then 

(14) q l [r(UR) -r(l,i?*)] < /?*) 一 r(/，/?)] < 0 ， 

i 口 2 

and r(l, R) < r(l, 尺 *)• Thus R* is not better than R. ^ 

Theorem 6.7.2. If q { > 0, i = then a Bayes procedure is admissi¬ 

ble. 
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We shall now assume that C(i\j) — 1, i ^/, and Pr{/? f (x)= 0| ttj) - 0. The 
latter condition implies that all p-(x) are positive on the same set (except fcr 
a set of measure 0). Suppose q ( = 0 for i= 1,and g, > 0 for / = f + 
Then for the Bayes solution R t , = 1, •.. ， f，is empty (except for 
a set of probability 0), as seen from (11) [that is, p m {x) = 0 for x in 
It follows that r(/, R) = R) — l — R) — l for / — 1,..., 

Then (/? f+J ,..., R m ) is a Bayes solution for the problem involving 
p t + iix\ ^ p m (x) and g /+J ，.. •，分 m . It follows from Theorem 6.7.2 that no 

procedure R* for which P{i\i, /?*) 0, /=】，•••， /， can be better than the 

Bayes procedure. Now consider a procedure R* such that includes a set 
of positive probability so that P(l| 1, R*) > 0. For R* to be better than R, 

(15) f Pl (x)dx 

、 

< P(i\i, /?*) = f p t (x) dx, i = 2, … ， m. 

In such a case a procedure R** where Rf* is empty, i = 1,.. .,r, /?** = /?*, 
= f + l，.，，，m — 1， and /?** = /?* U /?* U ••• UR* would give risks such 
that 

尸（山 . ，尺 **) = 0 ， / = 

(16) P(i\i, R^) P(i\i, R^) > P(i\i, R), / = f + …， m — l ， 

P(m|m, /?**) > P(m\m 9 R*) > P(m\m, R), 

Then /?**) would be better than (/? / + R m ) for the (m - 0- 

decision problem, which contradicts the preceding discussion. 

Theorem 6.7.3. If C(i\j) = 1, £ #/, and Pr{/? ( (jc) = 0| ttj} = 0, then a Bayes 
procedure is admissible. 

The converse is true without conditions (except that the parameter space 
is finite). 

Theorem 6.7.4. Every admissible procedure is a Bayes procedure. 

We shall not prove this theorem. It is Theorem 1 of Section 2.10 of 
Ferguson (1967)，for example. The class of Bayes procedures is minimal 
complete if each Bayes procedure is unique (for the specified probabilities). 

The minimax procedure is the Bayes procedure for which the risks are 
equal. 



6.8 CLASSIFICATION INTO ONE OF SEVERAL NORMAL POPULATIONS 237 

There are available general treatments of statistical decision procedures 
by Wald (1950)，Blackwell and Girshick (1954)，Ferguson (1967)，De Groot 
(1970)，Berger (1980b), and others. 


6.8. CLASSIFICATION INTO ONE OF SEVERAL MULTIVARIATE 
NORMAL POPULATIONS 


We shall now apply the theory of Section 6.7 to the case in which each 
population has a normal distribution. [See von Mises (1945).] We assume that 
the means are different and the covariance matrices are alike. Let 2) 

be the distribution of tt,. The density is given by (1) of Section 6.4. At the 
outset the parameters are assumed known. For general costs with known 
a priori probabilities we can form the m functions (5) of Section 6.7 and 
define the region Rj as consisting of points x such that the ;th function is 
minimum. 

In the remainder of our discussion we shall assume that the costs of 
misclassification are equal. Then we use the functions 


Pi( x ) 




If a priori probabilities are known, the region R 】is defined by those x 
satisfying 


( 2 ) 


R r u 々 ) >lo g 老， 


A: = 1 ， ... ， m. k # j. 


Theorem 6.8.1. If is the a priori probability of drawing an observation 
from 7r f — M〆 0 , S), i = 1,, • • ， m ，and if the costs of misclassification are equal, 
then the regions of classification, R m , that minimize the expected cost are 

defined by (2), where u jk (x) is given by (1). 

It should be noted that each u jk (x) is the classification function related to 
the ;th and kth populations, and u jk (x) = -u kj (x\ Since these are linear 
functions, the region R { is bounded by hyperplanes. If the means span an 
(m — l)-dimensional hyperplane (for example, if the vectors (jl ⑴ are linearly 
independent and p > m ~ 1), then is bounded by m — 1 hyperplanes. 

In the case of no set of a priori probabilities known, the region Rj is 
defined by inequalities 


( 3 ) 


u jk (x)>c j -c k . 


/c == 1,.. .,m, k ¥= j. 
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The constants c k can be taken nonnegative. These sets of regions form the 
class of admissible procedures. For the minimax procedure these constants 
are determined so all PU\i, R) are equal. 

We now show how to evaluate the probabilities of correct classification. If 
X is a random observation, we consider the random variables 

(4) V lt = [X - 士 (… ’)+ 〆’))_ 

Here L/ Jf = - \J iy Thus we use m(m — 1)/2 classification functions if the 
means span an (m - l)-dimensional hyperplane. If X is from then is 
distributed according to N({^ jn where 

(5) A 卜 ㈣ 、 -〆【))_ 

The covariance of U Jt and U Jk is 

(6) v# WOTV”-〆 0 ), 

To determine the constants c } we consider the integrals 

(7) P(j\j.R) = / •../ fjdun …… du jm ， 

c r c m c r c x 

where f } is the density of (J jn / = 1,2,..., m, i 丰 j. 

Theorem 6.8.2. If 7r r is N(\^ t \'l) and the costs of misclassification are 
equal、then the regions of classification ， R [y ..., R m , that minimize the maximum 
conditional expected loss are defined by (3), where u jk (x) is given by (1). The 
constants Cj are determined so that the integrals (7) are equal. 

As an example consider the case of m = 3. There is no loss of generality in 
taking p = 2, for the density for higher p can be projected on the two-dimen¬ 
sional plane determined by the means of the t iree populations if they are not 
collinear (i.e., we can transform the vector x into w I2 , u I3 , and p 一 2 other 
coordinates, where these last p — 2 components are distributed indepen¬ 
dently of U| 2 and u l3 and with zero means). The regions are determined 
by three half lines as shown in Figure 6.2. If this procedure is minimax, we 
cannot move the line between R { and R 2 rearer (/4[))，the line between 
R z and R 3 nearer (yx ( , 2 \/i ( 2 2) ), and the line between 尺 3 and R { nearer 
(/x ( , 3 \/x ( 2 3) ) and still retain the equality P(l| 1, R) = F(2|2, R) = P(3|3, R) 
without leaving a triangle that is not included in any region. Thus, since the 
regions must exhaust the space, the lines must meet in a point，and the 
equality of probabilities determines c t — c } uniquely. 
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Figure 6.2, Clussificdtion regions. 


To do this in a specific case in which wc have numerical values for the 
components of the vectors jx (l) , |x (2 \ jx (3) , and the matrix 2, we would con¬ 
sider the three (S/? + l) joint distributions, each of two U"’s (j ^ i). We 
could try the values of c, = 0 and, using tables [Pearson (1931)] of the 
bivariate normal distribution, compute jR). By a trial-and-error method 
we could obtain c ( to approximate the above condition. 

The preceding theory has been given on the assumption that the parame¬ 
ters are known. If they are not known and if a sample from each population 
is available, the estimators of the parameters can be substituted in the 
definition of Let the observations be from £), 

i = 1 ， … ， /n. We estimate ⑴ by 

(8) i (，) = ^ E 

and X by 5 defined by 

(9) ( £ E 

\f = I ) / = 1 or^l " " 

Then, the analog of is 

(10) W u (x) = [^- 

If the variables above are random, the distributions are different from those 
of U tj . However，as — oo，the joint distributions approach those of 
Hence, for sufficiently large samples one can use the theory given above. 
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Table 6*2 


Measurement 


Mean 


Brahmin 

(^i) 

Artisan 

(耵 2) 

Korwa 

(苁 3) 

Stature (jCj) 

164.51 

160.53 

158.17 

Sitting height (x 2 ) 

8643 

81.47 

8L16 

Nasal depth (x 3 ) 

25.49 

23.84 

21.44 

Nasal height (x 4 ) 

51.24 

48.62 

46.72 


6 . 9 . AN EXAMPLE OF CLASSIFICATION INTO ONE OF SEVERAL 
MULTIVARIATE NORMAL POPULATIONS 


Rao (1948a) considers three populations consisting of the Brahmin caste 
(ttjX the Artisan caste (7 t 2 )， and the Korwa caste (7 r 3 ) of India. The 
measurements for each individual of a caste are stature (x Y \ sitting height 
(jc 2 ), nasal depth (x 3 ), and nasal height (x 4 ). The means of these variables in 
the three populations are given in Table 6.2. The matrix of correlations for 
aU the populations is 


1.0000 

0.5849 

0.1774 

0.1974 

0.5849 

1.0000 

0.2094 

0.2170 

0.1774 

0.2094 

1.0000 

0.2910 

0.1974 

0.2170 

0.2910 

1.0000 


The standard deviations are o- { = 5.74, o- 2 = 3.20 ， cr 3 = 1.75, cr 4 = 3*50, We 
assume that each population is normal* Our problem is to divide the space of 


the four variables x v x 2 , x 3 , x 4 into three regions of classification. We 
assume that the costs of misclassification are equal We shall find (i) a set of 


regions under the assumption that drawing a new observation from each 
population is equally likely (q { and (ii) a set of regions such 

that the largest probability of misclassification is minimized (the minimax 
solution). 


We first compute the coefficients of ⑴一 p/ 2 )) and ⑴一 p/ 3) ) ‘ 

Then X 一 K〆 2 ) — = 芝 ― 弋 〆 1 ) 一 jjl (3) )— 芝 ⑴一 jjl ( 2) ). Then we calcu¬ 
late ;(|jl ⑴ + — We obtain the discriminant functions^ 


u l2 (x) = ~0.0708x l -f 0*4990x 2 -f 03373x 3 4- 0.0887 〜 — 43.13, 

(2) w I3 (x) = 0.0003x I + 03550^2 + l 、 1063x 3 + 0J375x 4 - 6249, 

u^x) = 0.0711a - 0.1440x 2 + 0,7690x 3 -f 0.0488x 4 - 19,36, 


t Due to an error in computations, Rao’s discriminant functions are incorrect. I am indebted to 
Mr. Peter Frank for assistance in the computations. 
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Table 63 





Standard 


Population of x 

U 

Means 

Deviation 

Correlation 

a 

U ]2 

1.49] 

L727 

0.8658 


U 13 

3.487 

2.641 

7T 2 

U 2\ 

L491 

1/727 

-0.3894 


以 23 

1.031 

1.436 

贫 3 

^1 

3.487 

2,641 

0.7983 


w 32 

1,031 

1.436 


The other three functions are u 2 i {x)= —u l 2 (x\ u 3l (x) = —w 13 (jc), and 
u 32 (^) == — u 23 (xX If there are a priori probabilities and they are equal, the 
best set of regions of classification are R } : u n {x) > 0, > 0; R z : 

u 2 i (x) ^ 0, u^ix) > 0; and R 3 : u 31 (jc) > 0, w 32 (^) > 0* For example, if we 
obtain an individual with measurements x such that u l 2 (x) > 0 and u^{x) ^ 
0, we classify him as a Brahmin. 

To find the probabilities of misclassification when an individual is drawn 
from population n g we need the means, variances, and covariances of the 
proper pairs of m's. They are given in Table 6.3. + 

The probabilities of misclassification are then obtained by use of the 
tables for the bivariate normal distribution. These probabilities are 0.21 for 
7T X , 0.42 for 7r 2 , and 0.25 for ir y For example, if measurements are made on 
a Brahmin, the probability that he is classified as an Artisan or Korwa is 0.21, 

The minimax solution is obtained by finding the constants c u c 2 . and c ； 
for (3) of Section 6,8 so that the probabilities of misclassification are equal. 
The regions of classification are 


R \- «I2 ㈤ 乏 

0.54, 


0.29 ； 

R'2-U 21 (X) ^ . 

- 0,54, 

h ⑺之 - 

-0.25 ； 

A: MO > 

-0.29 ， 

«32( 文） 之 

0.25. 


The common probability of misclassification (to two decimal places) is 0.30. 
Thus the maximum probability of misclassification has been reduced from 
042 to 030. 


t Some numerical errors in Anderson (i95la) are corrected in Table 6.3 and (3), 
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6.10, CLASSIFICATION INTO ONE OF TWO KNOWN MULTIVARIATE 
NORMAL POPULATIONS WITH UNEQUAL COVARIANCE MATRICES 

6.10. L Likelihood Procedures 

Let 77! and it z be S j) and iV(|Ji (2) , S 2 ) with jjl (1) ^ jjl ( 2) and ^ 2 2 - 

When the parameters are known, the likelihood ratio is 

Pi(x) U 2 l + exp [- 士 0- tt ⑴ ) ’Sr 1 ( 文 一 〆】))] 

尸 2 ⑷ |2,| • exp[- 3 (a: - Jt (2, ) , 22 '(^ - M- (2) )i 

= - + exp [ 奴 a: - WS; 1 。- jt (2) ) 

The logarithm of (1) is quadratic in x. The probabilities of misclassification 
are difficult to compute. [One can make a linear transformation of x so that 
its covariance matrix is I and the matrix of the quadratic form is diagonal; 
then the logarithm of (1) has the distribution of a linear combination of 
noncentral /-variables plus a constant.] 

When the parameters are unknown, we consider the problem as testing 
the hypothesis that jc ， jc' 1 ) ， … ， jeJJj 1 are observations from iV(〆" ， 5^) and 
x ^\..., x ( ^] are observations from 2 2 ) against the alternative that 

x[ u . 八 ?: are observations from \ 2,) and x, x ( ^\. *., are obser¬ 

vations from N(iL i2 \ 2 2 ). Under the first hypothesis the maximum likelihood 
estimators are ^ - (A^ 1 ) + x)/(N ] + IX jjL ( j 2) =i (2) , 

^i(0 = 7v~rr Ai + 7^Ti( x - i(,) )( x ~ x0) y ' 

( 2) . i' 1 

where A { = ~ i ^ 1*2, (See Section 6,5.5.) Under 

the second hypothesis the maximum likelihood estimators are 
lif ^(N 2 x {1) + x)/(N 2 + l\ 


之 (2 卜七*〗， 

= mtt A 2 + j^i( x -^ 2) )( x -x (2) y - 
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The likelihood ratio criterion is 

[1 + (文一无( 2 ))，4 2 — '(X — 无 ( 2) )]^ 2 + 1) 

(} |^( 1 )|^' + 1 ) | 2 2 ( 1 )|^ _ [l + (x-x^yA^(x~x^ (Nl+1) 

(Ni + 1 )^ ( ^ +1) ^^^|^[ 2 |' 

N 卜 ( N 2 + 1) 批 + 1) 引乂 

The observation x is classified into 73 ^ if (4) is greater than 1 and into tt 2 if 

(4) is less than L 

An alternative criterion is to plug estimates into the logarithm of (1) ‘ Use 

( 5 ) 

to classify into tt ] if (5) is large and into tt 2 if (5) is small. Again it is difficult 
to evaluate the probabilities of misclassification. 

6.10.2. Linear Procedures 

The best procedures when Xj # X 2 are not linear; when the parameters are 
known, the best procedures are based on a quadratic function of the vector 
observation x. The procedure depends very much on the assumed normality* 
For example, in the case of p = 1, the region for classification with one 
population is an internal and for the other is the complement of the interval 
— that is, where the observation is sufficiently large or sufficiently small In 
the bivariate case the regions are defined by conic sections; for examole, the 
region of classification into one population might be the interior of an ellipse 
or the region between two hyperbolas. In general，the regions are defined by 
means of a quadratic function of the observations which is not necessarily a 
positive definite quadratic form. These procedures depend very much on the 
assumption of normality and especially on the shape of the normal distribu¬ 
tion far from its center. For instance, in the univariate case cited above the 
region of classification into the first population is a finite interval because the 
density of the first population falls off in either direction more rapidly than 
the density of the second because its standard deviation is smaller. 

One may want to use a classification procedure in a situation where the 
two populations are centered around different points and have different 
patterns of scatter, and where one considers multivariate normal distribu¬ 
tions to be reasonably good approximations for these two populations near 
their centers and between their two centers (though not far from the centers ， 
where the densities are small). In such a case one may want to divide the 
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sample space into the two regions of classification by some simple curve or 
surface. The simplest is a line or hyperplane; the procedure may then be 
termed linear. 

Let 6 (# 0) be a vector (of p components) and c a scalar. An observation 
x is classified as from the first population if b f x^c and as from the second if 
b f x<c. We are primarily interested in situations where the important 
difference between the two populations is the difference between the cen¬ 
ters; we assume |x (1) ¥= jjl (2) as well as # X 2 ，and that Xj and S 2 are 
nonsingular. 

When sampling from the ith population, b f x has a univariate normal 
distribution with mean <F(b f x\ rr^ = b r \^ l) and variance 


( 6 ) ^(b f x = <^b f (x - jjl ( 0 )( a : — % = b f Xib. 


The probability of misclassifying an observation when it comes from the first 
population is 


⑺ 


P(2|l) = Pr{6 , ^<c|7r 1 } = Pr 


b'x- o’〆 1 ) 


c — 




f C - b’iL (1) 、 

— 1 一 db 



一 X 一 



The probability of misclassifying an observation when it comes from the 
second population is 



P(l|2) = Pr{6^^c|7r 2 } = Pr 


b'x-b'iL {2) 

Tb'^bf 


c — b'\L {2) 

{b'x 2 by 


^2 


=1 -中 


c-&> (2) \ 

(b'^bV) 


It is desired to make these probabilities small or, equivalently，to make the 
arguments 


b'\L (1) -c _ c — & V 2) 

y2 ~ {b'^ 2 by 


large. We shall consider making y { large for given y 2 . 
When we eliminate c from (9)，we obtain 

(10) y l = [b'y-y 2 (b , X 2 b) i ](b'X l by k 
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where 7 = jx ⑴一 |x (2 \ To maximize for given y z wc differentiate y { with 
respect to b to obtain 


(ii) 


If we let 
( 12 ) 

(13) 


- b'y -y^b'l.b^ib'l.byh^. 


.b'y-y 2 (b'l z b)" 
= -- 

t _ 乃 

2 ~ /b'l^ ' 


then ( 11 ) set equal to 0 is 

(U) (^,1, +^I 2 )6 = 7. 

Note that (13) and (14) imply (12). If there is a pair H, and a vector b 
satisfying (12) and (13), then c is obtained from (9) as 

(15) c=y 2] fh 7 l^b + 6 V z) = f 2 n 2 6 + 6 >( 2 ). 

Then from (9), (12), and (13) 


(16) 


yi = 






Now consider (14) as a function of / (0 < / < 1), Let / , = r and = 1 - r; 

then b = (/(Ii +^ 2 £ 2 ) -l 7 . Define d, = and u 2 = t ， jjb’'Z 2 b . The 

derivative of uj with respect to t is 

(17) ■^t 2 y / [d ] + (1 -f)I 2 ] 'ijrli + (1 -f)X 2 ] 

= 2 r 7 J [/ 2 i + (1 - r)2 2 ] + (1 - r)I 2 ] _1 7 

+ (l-r)5； 2 ] 一 '(Sr ：£:)[% +(l-Otr 1 

- r 2 7 7 [r2, + (1 - r)S 2 ] '^，[^!! + (1-/•)!,] 1 

= r 7 ， [^, + (l-0X2]" 1 {5：：[^ 1 +(l H ]、， 

+ I 1 [r2 1 +(l-r)2 2 ]^ 1 I,}[rI 1 + (l-r)I 2 ]- 1 7 


by the following lemma. 
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Lemma 6.KKL If X, and X 2 are positive definite and t' >0,t 2 > 0, then 

( 】 8) ^ ^2^2] 1 ^ 1 

is posirioe definite. 


Proof. The matrix (18) is 

( 19 ) +^ 2^1 ')^ 2 ] = (^ 1^2 1 + ^ 2^1 ') ■ 

Similarly dvl/dt < 0, Since v { > 0 , v 2 > 0, we see that v x increases with t 
from 0 at 卜 0 to at / = 1 and v 2 decreases from \ y^ 2 y at 

/ = 0 to 0 at i = 1. The coordinates v ] and v 2 are continuous functions of t. 
For given y 2 , 0 <y, <, , there is a t such that >^ 2 = = t l ^fb r f^b 

and b satisfies (14) for t x =^t and ~ = 1 一夂 Then = v v — t x yjb ! i'b maxi¬ 
mizes y, for that value of y 2 - Similarly given y u 0 <,y x <, \ y^^y, there is 
a / such that >* 1 = ^ y/b 7 %^ and b satisfies (14) for t x —t and 

and v> - v 2 = i t 2 }/b r X 1 b maximizes y 2 . Note that y 2 >0 implies the 

errors of misclassification are not greater than 5 . 

We now argue that the set of y^y 2 defined this way correspond to 
admissible linear procedures. Let x v x 2 be in this set, and suppose another 
procedure defined by z,, z 2 were better than x v x 2 , that is, <z^ x 2 <z 2 
with at least one strict inequality. For y x = £ z l let be the maximum y 2 
among linear procedures; then z, =_y l5 z 2 ^>2 and hence x x <y u x 2 
However, this is possible only if x { x 2 =>!, because dy x /dy 2 < 0. Now 
wc have a contradiction to the assumption that z x , z 2 was better than x v x 2 - 
Thus Xj. x 2 corresponds to an admissible linear procedure. 

Use of Admissible Linear Procedures 

Given r, and t 2 such that + r 2 S 2 is positive definite，one would 
compute the optimum b by solving the linear equations (15) and then 
compute c by one of (9). Usually t { and t 2 are not given，but a desired 
solution is specified in another way. We consider three ways. 


Minimization of One Probability of Misclassification for a Specified 
Probability of the Other 

Suppose we arc given y z (or ， equivalently, the probability of misclassification 
when sampling from the second distribution) and we want to maximize y { 
(or. equivalently, minimize the probability of misclassification when sampling 
from the first distribution). Suppose 少 2 >0 G.e., the given probability of 
misclassification is less than |), Then if the maximum y { > 0, we want to find 
r 2 = 1 — q such that y 2 where b = + ^ 2 ^ 2 ]^ 1 7 - The solu- 
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tion can be approximated by trial and error, since _y 2 is an increasing function 
of t 2 . For t 2 = 0, 少 2 = 0; and for ~ = 1, = (A 、） 

where X 2 b = 7 . One could try other values of t 2 successively by solving (14) 
and inserting in b , ^ t2 b until 匕 (V 2 2 i ) 士 agreed closely enough with the 
desired y 2 ^ [y\ > 0 if the specified y 2 < 

The Minimax Procedure 

The minimax procedure is the admissible procedure for which y { =_y 2 - Since 
for this procedure both probabilities of correct classification are greater than 
5 , we have y { —y 2 > 0 and t v > 0, / 2 > 0. We want to find，(=/【=1 一 / 2 ) so 
that 

( 20 ) 

Since y\ increases with t and y\ decreases with increasing t y there is one and 
only one solution to ( 20 )，and this can be approximated by trial and error by 
guessing a value of t (0 < / < 1)，solving (14) for b, and computing the 
quadratic form on the right of (20). Then another t can be tried. 

An alternative approach is to set y { -y 2 in (9) and solve for c. Then the 
common value of y v =y 2 is 

(21) - ^ - r, 

{b'^^y + ib'^by 

and we want to find b to maximize this, where b is of the form 

( 22 ) [t^ l + (l-tn 2 ]~ l y 

with 0 < f < 1 . 

When — 2 2 , twice the maximum of (21) is the squared Mahalanobis 
distance between the populations. This suggests that when may be 
unequal to 2 2 ，twice the maximum of ( 21 ) might be called the distance 
between the populations. . 

Welch and Wimpress (1961) have programmed the minimax procedure 
and applied it to the recognition of spoken sounds. 

Case of A Priori Probabilities 

Suppose we are given a priori probabilities, q x and q 2 , of the first and second 
populations, respectively. Then the probability of a misdassification is 

(23) 士 [1- 巾 (h)] + 92 [l-0(>> 2 )] =1- [10(h) +q 2 ^(y 2 )], 
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which we want to minimize. The solution will be an admissible linear 
procedure. If we know it involves y { >(i and y 2 > 0, we can substitute 

y l =t(b , ^ i b)^ and 少 2 = (1 — 0 ( 办士 ， where A = + (1 — OSJ ” 1 ?， 

into (23) and set the derivative of (23) with respect to t equal to 0, obtaining 

( 24 ) == o> 

where cf>(u) - (2tt)^ ^ There does not seem to be any easy or direct 
way of solving (24) for t 、The left-hand side of (24) is not necessarily 
monotonic. In fact, there may be several roots to (24). If there are, the 
absolute minimum will be found by putting the solution into (23), (We 

remind the reader that the curve of admissible error probabilities is not 

necessary convex.) 

Anderson and Bahadur (1962) studied these linear procedures in general, 
including y x <0 and y 2 < 0. Clunies-Ross and Riffenburgli (1960) ap¬ 
proached the problem from a more geometric point of view. 


PROBLEMS 

6 .L (Sec, 6J) Let 77, be iV(ix ， 2 r ), i = 1,2, Find the form of the admissible 
classification procedures, 

6.2. (Sec. 63) Prove that every complete class of procedures includes the class of 
admissible procedures. 

6J. (Sec. 6.3) Prove that if the class of admissible procedures is complete, it is 
minimal complete, 

6.4. (See. 6.3) The Neyman-Pcarsoi fundawcntal lemma states that of all tests at a 
given significance level of the null hypothesis that x is drawn from p { (x) 
against alternative that it is drawn from p 2 (x) the most powerful test has the 
critical region p l (x)/p 2 (x) <k. Show that the discussion in Section 63 proves 
this result. 

6-5. (Sec. 6.3) When p(x) = n(x\ px, 2) find the best test of pi = 0 against pi* 
at significance level e. Show that this test is uniformly most powerful against all 
alternatives \i = cp,*, c > 0. Prove that there is no uniformly most powerful test 
against jx (I) and |X (2) unless |i (I) = c|x (2) for some c > 0. 

6.6. (Sec, 6,4) Let P(2[l) and P(l|2) be defined by (14) and (15). Prove if 
—I A 2 <c < 士 A 2 , then P(2\ 1) and P(l| 2) are decreasing functions of A, 

6.7. (Sec, 6.4) Let x f jc ( 2) 0. Using Problem 5.23 and Problem 6.6, prove 

that the class of classification procedures based on x is uniformly as good as 
the class of procedures based on 
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6*8. (Sec* 6,5,1) Find the criterion for classifying irises as Iris se(osa or Iris 
versicolor on the basis of data given in Section 5.3.4. Classify a random sample 
of 5 Iris virginica in Table 3.4. 

6.9. (Sec. 6*5.1) Let W{x) be the classification criterion given by (2). Show that the 
r 2 -criterion for testing 2) = ⑺， 2) is proportional to I4 / (x u, ) and 

Wix^). 

6.10. (Sec. 6.5.1) Show that the probabilities of misclassification of jc iV (all 

assumed to be from either rr ] or 7 t 2 ) decrease as N increases. 

6.11. (Sec. 6.5) Show that the elements of M are invariant under the transforma¬ 
tion (34) and that any function of the sufficient statistics that is invariant is a 
function of A/, 

6.12. (Sec. 6.5) Consider d'x {n . Prove that the ratio 

(U ⑴ - ⑽ A ) 2 

飞 ^ iv ： " 

£ f (d'x^-d'x^y 


6.13. (Sec, 6.6) Show that the derivative of (2) to terms of order » — 1 is 


必 (士厶 V 


+ 


n 




6.14. (Sec. 6.6) Show <SD 2 is (4). [Hint: Let 2 — / and show that =/) ! 

[n/(n —p — 1)]/.] 


6.15. (Sec. 6.6.2) Show 


Pr< 


Z-\D 2 
~ D ~ 


<u 




Pr 


Z — y 


<u 


7T, 


必 ⑷ { 2N K 2 [“ 3 + (P - 3 )“ - 厶 〜 + P A 1 
+ 2~/^ ^ [“ 3 + 2A“ 2 + (p - 3 + Sr)u - 厶 3 + pA] 

+ 士 [3“ 3 + 4Aw 2 + (2p - 3 + A 2 )“ + 2(p - l)d] | + 0(n’ 2 ). 


6.16. (Sec. 6.8) Let tt, be M|JL (0 ,2), i = I__m. If the |i (,) arc on a line (i.c., 

|JL (/) = |t+V/p), show that for admissible procedures the /? f are defined by 
parallel planes. Thus show that only one discriminant function u fk (x) need be 
used. 
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6.17, (Sec. 6,8) In Section 8.8 data are given on samples from four populations of 
skulls. Consider the first two measurements and the first three samples. 
Construct the classification functions u^ix). Find the procedure for = 
N t /{N^ 4- N 2 + Find the minimax procedure. 

6.18. (Sec, 6.10) Show that b f x = c is the equation of a plane that is tangent to an 
ellipsoid of constant density of and to an ellipsoid of constant density of rr 2 
at a common point. 

6.19, (Sec. 6‘8) Let be observations from M〆 0 , 2), / 。 1,2,3, and let 

a be an observation to be classified. Give explicitly the maximum likelihood 
rule. 

6.20. (Sec. 6,5) Verify (33). 



CHAPTER 7 


The Distribution of the Sample 
Covariance Matrix and the 
Sample Generalized Variance 


XL INTRODUCTION 

The sample covariance matrix, S = [l/(iV — l)]ES =1 (x a — x)(x a — 无 )’， is an 
unbiased estimator of the population covariance matrix 2. In Section 4.2 we 
found the density of A = (N — l)S ia the case of a 2 X 2 matrix. In Section 
7.2 this result will be generalized to the case of a matrix A of any order. 
When 2 =/, this distribution is in a sense a generalization of the ^^distri¬ 
bution. The distribution of A (or S), often called the Wishart distribution, is 
fundamental to multivariate statistical analysis. In Sections 7.3 and 7.4 we 
discuss some properties of tbe Wishart distribution. 

The generalized variance of the sample is defined as \S\ in Section 7.5; it 
is a measure of the scatter of the sample. Its distribution is characterized. 
The density of the set of all correlation coefficients when the components of 
the observed vector are independent is obtained in Section 7,6. 

The inverted Wishart distribution is introduced in Section 7.7 and is used 
as an a priori distribution of 2 to obtain a Bayes estimator of the covariance 
matrix. In Section 7»8 we consider improving on S as an estimator of 2 with 
respect to two loss functions. Section 7.9 treats the distributions for sampling 
from elliptically contoured distributions. 


An Introduction to Multiuartate Statistical Analysis，Third Edition, By T. W, Anderson 
ISBN 0-471-36091-0 Copyright ◎ 2003 John Wiley & Sons. Inc 、 
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COVARIANCRMATRIX OIS1 RIBUTION ； GENERALIZED VARIANCE 


IX THE WISHART DISTRIBUTION 


We shall obtain the distribution of A = -XXX a ~X)\ where 

X V .” ， X N (N>p) are independent，each with the distribution 2), As 
was shown in Section 3.3, A is distributed as L n atssV Z a where n - N -1 
and Zp..., Z n are independent, each with the distribution MO, 2). We shall 
show that the density of A for A positive definite is 


(i) 


|/4| 办 U)exp(~4tr S~U) 


We shall first consider the case of 2 = /. Let 


( 2 ) 


(z【， ••” z„)= 




Then the elements of A = (a l} ) are inner products of these ^-component 
vectors, a tj — The vectors v y ,...,v p are independently distributed, each 
according to A^(0,/„), It will be convenient to transform to new coordinates 
according to the Gram — Schmidt orthogonalization. Let w x = 


(3) 


w i 


"，- L 


¥ 


: 2 , 》»•，/?• 


We prove by induction that w k is orthogonal to k < L Assume w' k w h = 0, 
k 丰 h ，= 1; then take the inner product of w k and (3) to obtain 

= 0, A: = 1， •.•，f — 1. (Note that Pr{||w f || = 0} = 0.) 

Define = 11%||= ， i = 1 ，…， p ， and t if = ^wj/Wwjll /= 1”.”/- 1 ， 


2, • • • ， p. Since v t ^ E; 、 1 0”/||w / ||)w / ， 


(4) 


v \ v t ! 


m'mihj) 


. }'■ 


If we define the lower triangular matrix T : 
t l} = 0, / <]\ then 

(5) A^TT 


(t t ) with t“> 0, i = 1 ， ." ， p ，and 


Note that t x - y /=1 ， .” ， / 一 1， are the first i — 1 coordinates of % in the 
coordinate system with as the first i-\ coordinate axes. (See 

Figure 7.1J The sum of the other n — / + 1 coordinates squared is II 巧 || 2 … 
； =^=ikii 2 ； w t is the vector from to its projection on w {9t •- 
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Figure 7.1. Transformation of coordinates. 


Lemma 7*2.1. Conditional on (or equivalently on 

t lV . t •and tf. are independently distributed^ is distributed according to 
N(0 y 1), i > /; and has the ^distribution with n - i + l degrees of freedom. 

Proof. The coordinates of v x referred to the new orthogonal coordinates 
with v iJlt , 9 v t _ x defining the first coordinate axes are independently nor¬ 
mally distributed with means 0 and variances 1 (Theorem 3.3.1). t} t is the sum 
of the coordinates squared omitting the first i 一 1. 鼸 


Since the conditional distribution of does not depend on 

〜 … ， I ；, •一卜 they are distributed independently of h ， ^! ， …〆 r-i • 卜 i. 

Corollary 7.2-1. Let Z', … ， "L n (ji >p) be independently distributed, each 
according to N(Q ， I); let A = L n Q ^ l Z Q Z f Q = TT\ where t” = 0, / </, and t tl > 0, 
i = 1，•，” p. Then t ny t 2V . “ 、 t pp are independently distributed; t l} is distributed 
according to N(0 y 1), i > /; and tf t has the x 2 ^istribution with n - i -hi degrees 
of freedom. 


Since t {i has density 2 一 士 …-卜 ”/"-七- /T[\(n + 1 —/)]， the joint density 
of t^jj 7 = 1, … ， ii，f = 1 ， •" ， p，is 




A 0(- 扣乂） 

L\ W- n 2 切 ~ l r [ 士 (n+ 1 - 0 ] 


n/L l C~ r exp(-|Efl 1 E； M ^) 
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Let C be a lower triangular matrix {c l} = 0, / </) such that 2 = CC' and 
c n > 0. The linear transformation T* — CT, that is, 


(7) 




kzst J 


0 , 


can be written 


t>j\ 

• 1 <U 


( 8 ) 





0 

0 

• * • 

0 

^21 


X 

C 22 

0 


0 



X 

X 

C 22 

t t • 

0 

c 丨 


X 

X 

X 

• * » 

c pp 

ji P _ 


X 

- 

X 

X 

» • ♦ 

X 


0 _ 


^11 

0 


h\ 

0 


hi 

0 



c pp 


tpp 


where x denotes an element, possibly nonzero. Since the matrix of the 
transformation is triangular，its determinant is the product of the diagonal 
elements, namely, nf =l c? r The Jacobian of the transformation from T to 
is the reciprocal of the determinant. The density of is obtained by 
substituting into (6) t u = -tt/c u and 

(9) f ： E^. = tr7T- 

二 tr 一 1 

=tr m 一 1 

and using = |C||C'| = |2|. 

Theorem 7.2.1. Let Z p ... ， Z fl (n > p) be independently distributed y each 
according to A^(0,2); let A = Z n a ^ l Z a Z , a ; and let A = where =0, 

i <j, and > 0. Then the density of T* is 

ftp i lr x** 1 r*r* r 

( 10 ) - _f - 

、 2k (n _ 2 V 价 _ l )/ 4 i2i 如 n,p =1 r [ 士 （rt + i_f)] ’ 
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We can write (4) as a hi = for h > L Then 

(11) 為=0， 

= 0 ， k = h，l > i; 

that is， da hl /dt^ = 0 if k，l is beyond h，i in the lexicographic ordering. The 
Jacobian of the transformation from A to T* is the determinant of the lower 
triangular matrix with diagonal elements 

(12) 备=说， 

ai hh 

(13) 為《 h> “ 

The Jacobian is therefore 2 P F1/1 ^^ + 卜、 The Jacobian of the transforma¬ 
tion from T* to A is the reciprocal. 

Theorem 7.2.2. Let Z v ..^Z n be independently distributed, each according 
to MO, 2). The density of H n a=x Z a Z , a is 


(14) 


2"U 


for A positive definite，and 0 otherwise. 

Corollary 7.2.2. Let X', … ， X N (N > p) be independently distributed, each 
according to M|x, 2). Then the density of A = TJ^(X a -XXX a —X) r is (14) 
for n==N - L 

The density (14) will be denoted by w{A\^ y n\ and the associated distri¬ 
bution will be termed Wi^ y n). If n <p, then A does not have a density, but 
its distribution is nevertheless defined，and we shall refer to it as n\ 

Corollary 7.2*3. LetX u •… X N (N>p) be independently distributed，each 
according to The distribution of S = (l/n)L l ^ =t{ {X a -X)(X a -X) T is 

W[(l /n) 2 > n\ where n=N — L 

Proof S has the distribution of YP a ^ y [{\ / 4n)Z a \[(l/y/n)Z a ]\ where 
(l/y/n)Z { y. tty (l/y/n)Z N are independently distHbuted，each according to 
A^0 ，（ l/>z)2). Theorem 7.2.2 implies this corollary. ■ 
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The Wishart distribution for p = 2 as given in Section 42,1 was derived by 
Fisher (1915), The distribution for arbitrary p was obtained by Wishart 
(1928) by a geometric argument using v ly .. fy v p defined above. As noted in 
Section 3.2, the ith diagonal element of A is the squared length of the ith 
vector, a tl = = ||^.|| 2 , and the i, /th off-diagonal element of A is the prod¬ 

uct of the lengths of v x and and the cosine of the angle between them. The 
matrix A specifies the lengths and configuration of the vectors* 

We shall give a geometric interpretation^ of the derivation of the density 
of the rectangular coordinates t ip i> when 2 = jT. The probability element 
of t u is approximately the probability that ||^ t || lies in the [ntervrl t X{ <||〜| 
<r u +dt {l . This is the probability that v x falls in a spherical shell in n 
dimensions with inner radius and thickness dt u . In this region, the density 
exp (— \v\v x ) is approximately constant，namely, exp (- \t \ y ). 

The surface area of the unit sphere in n dimensions is C(n) == 2^ /T(^n) 
(Problems 7.1-7.3)， and the volume of the spherical she]] is approximately 
C(n)t^ ] dt iV The probability element is the product of the volume and 
approximate density, namely, 

(15) 2-( 卜㈣)〜/「(㈤， 

The probability element ofgiven (i.e., given 

•” w,]) is approximately the probability that 巧 falls in the region for 
which t iy <i ；； H> 1 /||H> 1 ||<^ 1 +dt hi _ l , 

and t fl < ||wj| < t it - dt io where i^- is the projection of v t on the (n - i + 1)- 
dimensional space orthogonal to Wp..., Each of the first i — l pairs of 
inequalities defines the region between two hyperplanes (the different pairs 
being orthogonal). The last pair of inequalities defines a cylindrical shell 
whose intersection with the (i - l)-dimensional hyperplane spanned by 
q， ••” 巧 _i is a spherical shell in n — i + l dimensions with inner radius t u . 
In this region the density (2 tt 广 ^ n exp(—is approximately constant， 
namely，^ exp (— \ The volume of the region is approximately 

dt ti … dt t i _ v C(n — / + l)d_ r dt“. The probability element is 


(16) 


2 -… - 〜 - exp(-^E；.,^) 

r[i(« + i-0] 


治 I'I 



Then the product of (15) and (16) for / = 2, …， p is (6) times dt n dt pp ‘ 
This aralysis, which exactly parallels the geometric derivation by Wishart 
[and later Mahalanobis, Bose, and Roy (1937)]，was given by Sverdrup 

t [n the first edition of this book, the derivation oF the Wishart distribution and its geometric 
Interpretation were in terms of the nonorthogonal vectors v v 
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(1947) [and by Fog (1948) for p = 3], Another method was used by Madow 
(1938)，who drew on the distrihutiou of correlation a^cfficicnts (for l - /) 
obtained by Hotelling by considering certain partial correlation coefficients. 
Hsu (1939b) gave an inductive proof, and Rasch (1948) gave a method 
involving the use of a functional cqualion. A dilicrcnt method is to obtain Lhc 
characteristic function and invert it, as was done by Ingham (1933) and by 
Wishart and Bartlett (1933). 

Cramer (1946) verified that the Wishart distribution has the characteristic 
function of ^4. By means of alternative matrix transformations Elfving (1947). 
Mauldon (1955)，and Olkin and Roy (1954) derived the Wishart distribution 
via the Bartlett decomposition .， Kshirsagar (1959) based his derivation on 
random orthogonal transformations. Narain (1948),(1950) and Ogawa (1953) 
used a regression approach* James (1954), Khatri and Ramachandran (1()58)， 
and Khatri (1963) applied different methods. Giri (1977) used invariance. 
Wishart (1948) surveyed the derivations up to that date. Some of these 
methods are indicated in the problems. 

The relation A = TT is known as the Bartlett decomposition [Bartlett 
(1939)], and the (nonzero) elements of T were termed rectangular coordinates 
by MahalanobLs, Bo.sc, and Roy 0937), 

Corollary 7*2.4 

(17) f. . • f \B\'~' {p+l) e- irB dB= fl r [ 卜 … 一1 )] _ 

b>o - = 1 " 

Proof. Here fi > 0 denotes B positive definite. Since (14) is a density, its 
integral for /4 > 0 is 1. Let 2 = /, /4 = 2B {dA = 2 dB), and n = 2t t Then the 
fact that the integral is 1 is identical to (17) for t a half integer. However, if 
we derive (14) from (6 )， we can let n be any real number greater than p *- 1. 
In fact (17) holds for complex t such that //\t>p - 1. (means the real 
part of t.) ■ 

Definition 7.2,1. The multivariate gamma function is 

(is) r p (o = w'wflr [卜士 (/-1)]. 

I - I 


The Wishart density can be written 


(19) 


iv(/4IS ， n)= 





7.3. SOME PROPERTIES OF THE WISHART DISTRIBUTION 


7.3.1. The Characteristic Function 


The characteristic function of the Wishart distribution can be obtained 
directly from the distribution of the observations. Suppose Z [9 ...,Z n are 
distributed independently, each with density 


(1) 

Let 

( 2 ) 


(2tt ) 钓 


exp (- 务 z’2 _1 z). 


n 

扣 E H 


Introduce the pXp matrix © = (0-) with 0,- = 6 jr The characteristic func¬ 
tion of ^4 ii t /4 22 ?» • • ? App ， 2/4 j ^ is 

(3) <^txp[itr(A&)] =<^exp itr E Z a Z a e 

\ a — l 

=<^ exp I i tr Z^©Z tt 

V 0：=« 1 

= Zexp / E Z' a &Z a 

\ 

It follows from Lemma 2.6.1 that 



(4) S exp i E Z a eZ a = n ^exp(iZ^0Z a ) = [ Zexp(£•Z / ©Z)] , 


where Z has the density (1). For © real, there is a real nonsingular matrix B 
such that 

(5) = 

( 6 ) 

where D is a real diagonal matrix (Theorem A.l ,2 of the Appendix), Jf we let 
z = By, then 

(7) exp(/Z , ©Z) - Zexp(VDy) 

=^Ylexp^d^ 2 ) 

/=1 

=n ^exp(irf // y；. 2 ) 

/= 1 
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2S9 


by Lemma 2.6.2. The /th factor in the second product is S expCwi.^* 2 ), 
where Y r has the distribution N(0, 1); this is the characteristic function of the 
^^distribution with one degree of freedom, namely (1 - 2idjj)~ 1 [as can be 
proved by expanding expiid^yf) in a power series and integrating term by 
term]. Thus 

p _i , 

(8) ^exp(iZ'&Z) = H(l-2id n ) : = I/ - 2iD\~^ 

/= i 

since I— 2iD is a diagonal matrix. From (5) and (6) we see that 

(9) \I-2iD\ = HB'SBl 

= - 2 / 0 ) 5 | 

=-IS- 1 -2/01 -|B| 

= \B\ 2 -\^~' -2i0|, 

|5 r | •|2~ 1 |'I5| = |/| = 1, and |B| 2 = 1/|2 _1 |- Combining the above re¬ 
sults, we obtain 

(10) 衣 exp[itr( AQ))= |2 ： 」口二 … =1/ — 2/02「— • 

It can be shown that the result is valid provided (^(a lk - 2/0 /fc )) is positive 
definite. In particular, it is true for all real 0. It also holds for 2 singular. 

Theorem 7.3.1. If Z p — Z n are independent, each with distributbn 
N(0, 2), then the characteristic function of A U9 .. t , A ppy 2A l2 , • • • ， 2A p _ [pi 
where (A lf )=A = E^jZ a Z r ai is given by (10). 

7.3.2. The Sum of Wishart Matrices 

Suppose the A r i = 1,2, are distributed independently according to 〜)， 
respectively. Then A { is distributed as [HZ:，and A 2 is distributed as 
+iZ a Z^, where Z p ••• ， Z〜 +/i2 are independent, each with distribution 
iV(0, 2). Then A = A { -\-A 2 is distributed as YJ l assX Z a Z f a , where n = n { + n 2 . 
Thus A is distributed according to n). Similarly, the sum of q matrices 
distributed independently, each according to a Wishart distribution with 
covariance 2， has a Wishart distribution with covariance matrix 2 and 
number of degrees of freedom equal to the sum of the numbers of degrees of 
freedom of the component matrices. 
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Theorem 7.3*2. If A q are independently distributed with A t dis¬ 
tributed according to n ,)， then 

(11) A = E A i 

jsa l 

is distributed according to W{^ y E^ =l n,). 


733. A Certain Linear Transformation 
We shall frequently make the transformation 

(12) A^CBC\ 

where C is a nonsingular p 乂 p matrix. If A is distributed according to 
W(^ 9 n), then B is distributed according to n) where 

(13) 

This is proved by the following argument; Let A = L n a „ l Z a Z , a： where 
are independently distributed, each according to N(0, 2). Then 
Y a = C" l Z a is distributed according to N(0, ^). However, 

rt rt 

(14) E Y a Y^C~ l E Z a Z' a C' 1 =C UC 

a=1 a=1 


is distributed according to n). Finally, \d(A)/^(B)\ 9 the Jacobian of 
the transformation (12)，is 


(15) 




Theorem 733. The Jacobian of the transformation (12) from A to B t where 
A and B are symmetric，is modi C| p+1 . 


73.4. Marginal Distribntions 

If A is distributed according to WiX, n\ the marginal distribution of any 
arbitrary set of the elements of A may be awkward to obtain. However* the 
marginal distribution of some sets of elements can be found easily* We give 
some of these in the following two theorems. 



7,3 SOME PROPERTIES OF THE WISHART DISTRIBUTION 


261 


Theorem 7*3.4. Let A and 1 be partitioned into q and p — q rows and 
columns. 


If A is distributed according to then A u is distributed according to 

Proof A is distributed as L n a „ { Z a Z f a , where the Z a are independeni, each 
with the distribution N(0, 2). Partition Z a into subvectors of q and p - q 
components, Z tt ==(Z^ 1), , Z^ 2)/ )'. Then Z\ l \ t . .,Z { n l) are independent, each 
with the distribution MO, X n ), and A n is distributed as which 

has the distribution n). 鼸 

Theorem 7.3,5 - Let A and 2 be partitioned into ，… ： p q rows and 
columns (p : + +p q 





A '<l' 


- 

• V 1 

(17) 

A ~ 

\ A ^ … 

V 

， 2 = 


■■ 


If 2 Jj? = 0 for i 丰 j and if A is distributed according to n), then 

A Ui A 22 ,A qq are independently distributed and A ff is distributed according to 
W(X Vr n). 

Proof. A is distributed as where Z l9 . t ., Z n are independently 

distributed, each according to N(d, 2). Let Z a be partitioned 

(18) : 

as A and X have been partitioned. Since 2^ = 0, the sets Z\ ]) . 

Zp> ， … ， Zp) ， ... ， Zp) are independent. Then /4 U = •… = 

are independent. The rest ol Theorem 73.5 follows from 
Theorem 7.3,4. II 


1222 

22 


II2 

X 


\i- -1 「 / 

1222 

/ 4/1 


/ 4/1 
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7.3.5. Conditional Distributions 

In Section 4.3 we considered estimation of the parameters of the conditional 
distribution of X {1) given X {1) =x {1) . Application of Theorem 7.2.2 to Theo¬ 
rem 4.3.3 yields the following theorem: 

Theorem 7.3.6. Let A and 2 be partitioned into q and p - • q rows and 
columns as in (16). If A is distributed according t (» n) y the distribution of 
A i\ 2 " zA i\^ A i 2 A 22 A 2 \ is WV^ n . 2 ,n -P + q) ， n>p-q t 

Note that Theorem 7.3.6 implies that A Ut2 is independent of A 22 and 
regardless of 2. 

7.4. COCHRAN’S THEOREM 

Cochran's theorem [Cochran (1934)] is useful in proving that certain vector 
quadratic forms are distributed as sums of vector squares. It is a statistical 
statement of an algebraic theorem, which we shall give as a lemma. 

Lemma 7.4.1. If the N X N symmetric matrix C t has rank r h / = 1,..., m, 
and 

(1) = v 

* = I 

then 

m 

( 2 ) Hr=N 

1 = 1 

is a necessary and sufficient condition for there to exist an N XN orthogonal 
mauix P such that for i = 1 ， ，• • ， m 

0 0 0\ 

(3) PC t P = 0 / 0；, 

\0 0 0, 

where I is of order r n the upper left-hand 0 is square of order (which is 

vacuous for l), and the lower-righi hand 0 is square of order EjLf+i ^ (which 
is vacuous for i = m). 

Proof. The necessity follows from the fact that (1) implies that the sum of 
(3) over f =]_，•••，m is l N . Now let us prove the sufficiency; we assume (2). 
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There exists an orthogonal matrix P f such that is diagonal with 

diagonal elements the characteristic roots of C,. The number of nonzero 
roots is r r the rank of C ; , and the number of 0 roots is N - r r We write 

o o cT 

(4) P/C ，卜 _0 A, 0 ， 

10 0 oj 

where the partitioning Is according to (3)，and A, is diagonal of order r" This 
is possible in view of (2)，Then 

/ 0 0 ^ 

0 /-A ; 0 . 

0 0 / ; 

Since the rank of (5) is not greater than — r = N - r r which is the sum 
of the orders of the upper left-hand and lower right-hand /’s in (5)，the rank 
of / -- is 0 and A ; = I. (Thus the r { nonzero roots of C ; are 1, and C, is 
positive semidefinite.) From (4) we obtain 

0 0 0 \ 

(6) C r P ； 0 I 0 

\0 0 0/ 

where B f consists of the r ; columns of Pj corresponding to / in (6). From (1) 
we obtain 

Bn 

(7) 卜 卜 （ B”B 2 ， …, B m ) • =P'P t 

广 1 ， 

where P = (B {y B 2 ,.,. y B m y. _ 

We now state a multivariate analog to Cochran’s theorem. 

Theorem 7.4.1. Suppose Y { ,... y Y N are independently distributed, each ac¬ 
cording to MO, 2). Suppose the matrix (c^) = C r used in forming 

N 

(8) G,= L c' ap Y a Y；, 

I 
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is of rank r,, and suppose 

(9) Ee,= iyj：- 

j = 1 a=l 

Then (2) is a necessary and sufficient condition for - to be indepen¬ 
dently distributed with Q i having the distribution r r )- 


It follows from (3) that C- is idempotent. See Section A.2 of the Appendix. 
This theorem is useful in generalizing results from the univariate analysis 
of variance. (See Chapter 8.) As an example of the use of this theorem, let us 
prove that the mean of a sample of size N times its transpose and a multiple 
of the sample covariance matrix are independently distributed with a singular 
and a nonsingular Wishart distribution, respectively. Let [，•♦■，匕 be inde¬ 
pendently distributed, each according to We shall use the matrices 

= (^) = (1/AO and C 2 = = [S afi - (1/AO]- Then 

(10) G 5 = E jjY a Y^ = NYY\ 

a. #3-1 

(11) e 2 = E (b- 去 卜 4’ 

a,/3=l V 1 

/V ― 

=E y a Y^-NfY' 

a- 1 

- E {Y a -Y)(Y a -fy, 

a- 1 

and (9) is satisfied. The matrix C x is of rank 1; the matrix C 2 is of rank N- 1 
(since the rank of the sum of two matrices is less than or equal to the sum of 
the ranks of the matrices and the rank of the second matrix is less than N). 
The conditions of the theorem are satisfied; therefore Q x is distributed as 
ZZ’, where Z is distributed according to MO, 2), and Q 2 is distributed 
independently according to 1). 

Anderson and Styan (1982) have given a survey of proofs and extensions of 
Cochran’s theorem. 


7.5. THE GENERALIZED VARIANCE 
7.5*1. Definition of the Generalized Variance 

One multivariate analog of the variance cr 2 of a univariate distribution is the 
covariance matrix 2. Another multivariate analog is the scalar |2|，which is 
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called the generalized variance of the multivariate distribution [Wilks (1932): 
see also Frisch (1929)]. Simi] lrly, the generalized variance of the sample of 
vectors is 


⑴ 


15 | 




In some sense each of these is a measure of spread. We consider them here 
because the sample generalized variance will recur in many likelihood ratio 
criteria for testing hypotheses. 

A geometric interpretation of the sample generalized variance comes from 
considering the p rows of X- .. y x N ) as p vectors in JV-diinensional 
space. In Section 3.2 it was shown that the rows of 


(2) ( X 1 -无 . x N~ i ) =X-xe', 

where e = (!>...> 1)^ are orthogonal to the equiangular line (through the 
origin and e); see Figure 3.2. Then the entries of 


(3) A^(X-xt f )(X-xz r y 

are the inner products of rows of X-xz\ 

We now define a parallelotope determined by p vectors in an 

n-dimensional space (n > p). If p = l, the parallelotope Ls the line segment 
v { . If p - 2, the parallelotope is the parallelogram with v } and v 2 as principal 
edges ； that is，its sides are v \ translated so its initial endpoint is at v 2 , 

and v 2 translated so its initial endpoint is at ty See Figure 7.2. If p = 3. the 
parallelotope is the conventional parallelepided with v 2l and v y as 



Figure 7.2. A parallelogram. 
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principal edges. In general, the parallelotope is the figure defined by the 
principal edges It is cut out by p pairs of parallel (/?-!)- 

dimensional hyperplanes, one hyperplane of a pair being spanned by — 1 of 

z'j,_ r (i and the other hyperplane going through the endpoint of the 

rcmnining vector. 

Theorem 7.5,1. If then the square of the p-dimerisional 

uolurne of the parallelotope mth as principal edges is \ V ! V \. 

Proof. If p = 1, then \V f V\ = = IkJI 2 , which is the square of the 

one-dimensional volume of v v If two A:-dimensional parallelotopes have 
bases consisting of (A:- l)-dimensional parallelotopes of equal (k - 1)- 
dimensional volumes and equal altitudes, their /^dimensional volumes are 
equal [since the A:-dimcnsional volume is the integral of the (k -1)- 
dimensional volumes]. In particular, the volume of a /c-dimensional parallelo¬ 
tope is equal to the volume of a parallelotope with the same base (in k — l 
dimensions) and same altitude with sides in the kth direction orthogonal to 
the first k - 1 directions. Thus the volume of the parallelotope with principal 
edges •••〆、‘ say P k? is equal to the volume of the parallelotope with 

principal edges . £， 卜卜 say P 卜卜 times the altitude of P k over P k _ , ; 

that is. 

(4) Vol(F,)=Vol(F,_ t )XAlt(^l^_,). 

It follows (by induction) that 

(5) Vol(&) = Vol(/M X Alt ( 尸 2 | 尸 !）X "• X Alt (〜 〜 

By tht ： construction in Section 7.2 the altitude of }\ over P k ‘ 、 is t kk = H ^||； 
lliat is, r f k is the distance of r k from the (大一 l)-dimensional space spanned 

by r,_(or Hcnco Voli/^) = I Since I K'K]= 

|rr r | = FI/', the theorem is proved. ■ 

Wc now apply this theorem to the parallelotope having the rows of (2) 
as principal edges. The dimensionality in Theorem 7.5.1 is arbitrary (but at 
least p). 

Corollary 7.5.1. The square of the p-dimensional volume of the parallelo¬ 
tope with the rows of (2) as principal edges is \A\ 7 where A is given by (3), * 

Wc shall see later that many multivariate statistics can be given an 
interpretation in terms of these volumes. These volumes are analogous to 
distances that arise in special cases when p = L 
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We now consider a geometric interpretation of \A\ in terms of N points in 
/7*space. Let the columns of the matrix (2) be representing N 

points in p-space. When p — 1, \A\ = Y, a y\ a , which is the sum of squares of 
the distances from the points to the origin. In general \A\ is the sum of 
squares of the volumes of all parallelotopes formed by taking as principal 
edges p vectors from the set … ， y N - 
We see that 


( 6 ) 


Jlyl, … 


Ml = 


^Lj)^ p — 1, a3^ l a 


JLy,u,y, 


Lyl 


E 

p 


- 1, a ^ 1 or 


Hy pa y\ 


1 a yp~ 1 ^ a 

a 


a 

^>yp- i , fiYpp 

yp^yp- \ 

a 


JLy) a y P -\. 

a 

a y \ 译 y p 谷 

t 

Hy 2 P -\. a 

a 

矗 

y P -\^y P ^ 

JLy pa y P -^ 

a 少 p/3 


by the rule for expanding determinants. [See (24) of Section A.1 of the 
Appendix.] In (6) the matrix A has been partitioned into p - 1 and 1 
columns. Applying the rule successively to the columns, we find 

( 7 ) 1^1 = E \y u , ! y,a ! \- 

a 】 ，…， 


By Theorem 7.5.1 the square of the volume of the parallelotope with 
Ti < ••- < J r as principal edges is 

( 8 ) K ，…. r,= Ly^y iP , 

P 

where the sum on /3 is over ( 下 p ■”，％)• If we now expand this determinant 
in the manner used for \A\,v/e obtain 

( 9 ) K r r = 
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where the sum is for each ^ over the range y p ). Summing (9) over all 

different sets (% < … y p \ we obtain (7). = 0 if two or more j8, are 

equal.) Thus \A\ is the sum of volumes squared of all different parallelotopes 
formed by sets of p of the vectors y a as principal edges. If we replace y a by 
x a -x y we can state the following theorem ： 

Theorem 7.5.2. Let |S| be defined by (1), where x、， ••” x N are the N 
vectors of a sample. Then |S| is proportional to the sum of squares of the 
volumes of all the different parallelotopes formed by using as principal edges p 
vectors with p ofx ly .^ y x N as one set of endpoints and x as the other，and the 
factor of proportionality is 1 /(N - l) p t 

The population analog of |5| is |X|, which can also be given a geometric 
interpretation. From Section 3.3 wc know that 

(10) Vr{X'X^X<x 2 p {cc)}-\-cc 

if X is distributed according to N(0, X )； that is, the probability is 1 — a that 
X fall inside the ellipsoid 

( 11 ) 

The volume of this ellipsoid is C(/?)| X| \ Xp( a )^/P^ where C(p) is defined 
in Problem 7.3, 


7-5.2. Distribution of the Sample Generalized Variance 

The distribution of \S\ is the same as the distribution of \A \ /(N - 1)' where 
A = and Z! ，…， Z rt are distributed independently, each according 

to MO, X), and n = N- 1, Let Z a = CY a , a= 1 ， … ， n，where CC f = X* Then 
are independently distributed, each with distribution A^(0, /)* Let 

(12) B= E Y a Y^= E C^Z a Z f a (C^ l y=C^A(C^); 

] a = 1 

then \A\ = \ C\ *\B\ ‘|C’| = |5| ‘ | • By the development in Section 7*2 we 

see that |B| has the distribution of and that are indepen¬ 

dently distributed with / 2 -distributions* 

Theorem 7.53. The distribution of the generalized variance |S| of a sample 
X ly … ， X N from N(|x, X) is the same as the distribution of \^\ /(N - l、 p times 
the product ofp independent factors, the distribution of the fth factor being the 
X 2 -distribution with N — i degrees of freedonu 
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If /? = 1, |S| has the distribution of \ X\ 一 i /(N - U p = 2, |S| has 
the distribution of l) 2 . It follows from Problem 7*15 or 

737 that when p = 2, |S| has the distribution of 12|( ;^ 况 - 4 ) 2 /(2#- 2) 2 _ We 
can write 


( 13 ) Wl =1^1 X W-l X W-2 X … X Xs-p - 

If p = 2r, then \A\ is distributed as 

(’ 4) \X\ ( XlN-^ X XlN-% X … X XlN-^r) /2 2r . 

Si;ice the /ith moment of a Y 2 -vai*iab]e with m degrees of freedoni is 
2 h r({m + h)/Y{\rri) and the moment of a product of independent variables 
is the product of the moments of the variables, the hth moment of 1-41 is 


(15) 


Thus 


\x\ h u{^ 


r[|( N — i) + /z] 


= 2^121* 
= 2 h P\l\ h 


nr„r[i(iv-o+*] 

r p [\_(N-i) + h] 

r P [H^-i)]. 


P 

(16) S\A\ =|2in(^-0- 

i = 1 


(17) 


i-l 


hnN-j^-hnN-n 


where Y{\A\) is the variance of \A\, 


7-5.3. The Asymptotic Distribution of the Sample Generalized Variance 

Let \B[/n p = V { (n) x V 2 (n) x X V p (n\ where the V's are independently 
distributed and nV^n) = ^ p+f . Since xL p+ i ^ distributed as L n 0 l p { ^ 1 W^ 2 . 
wnere the W a are independent, each with distribution N(0,1\ the central 
limit theorem (applied to W a 2 ) states that 


(IS) 


nV,{n)-{n -pj¥i) ^ V M ~ 1 + ^ 

^ 2 {n-p KO n 否 


is asymptotically distributed according to N(0 9 1). Then — 1] is 

asymptotically distributed according to iV(0, 2), We now apply Theorem 4,2.3. 
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We have 




{ vm) 



(19) 

I7(n)= 

二 

， b = 






1 


\B\ /n p =w=f(u ll ..,,u p ) = u l u 2 — u p , T = 2/, = 1, and <f)' b T4> b 

= 2p，Thus 



is asymptotically distributed according to N(0 y 2p). 

Theorem 7.5.4* Let S be a p 乂 p sample covariance matrix with n degrees of 
freedom. Then v^i"(|Sl / |X| _ 1) is asymptotically normally distributed with 
mean 0 and variance 2p, 


7.6. DISTRIBUTION OF THE SET OF CORRELATION COEFFICIENTS 
WHEN THE POPULATION COVARIANCE MATRIX IS DIAGONAL 


In Section 4.2.1 we found the distribution of a single sample correlation when 
the corresponding population correlation was zero. Here we shall find the 
density of the set r ip i <j ， i，J = when p" = 0, f </• 

We start with the distribution of A when H is diagonal ； that is, 
n]. The density of A is 

⑴ I fl" A” - 1 ) exp (- g"/ ff") 

^ np n^r p (\n) 、 

since 


( 2 ) 


III 


0 


0 


'n 


0 0 


0 

0 


pp 


p 

n ff ；/- 
1-1 


We make the transformation 


(3) 


a ij ~ V^" 〜， 




⑷ 


a 


= fl „. 
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The Jacobian is the product of the Jacobian of (4) and that of (3) for a a 
fixed. The Jacobian of (3) is the determinant of a p(p — 1)/2-order diagonal 
matrix with diagonal elements * Since each particular subscript k, 

say, appear? in the set r" (f </) p ~ 1 times, the Jacobian is 


(5) 


n4 p - 1] - 


If we substitute from (3) and (4) into wiAlicr^S^), n] and multiply by (5)，we 
obtain as the joint density of {a h ) and {r^} 


( 6 ) 




士 (n—p — l) 


2 ^n 4 KV) 


邓 ㈠ 


since 


lr ; ； l^- p - 1 ) ^far i exp(-K/^) 

Ki I 2^4" 


⑺ 


I 

v^v^ r "i = I 〜 I 


where r u = L In the /th term of the product on the right-hand side of (6)，let 
a u /{2cr u ) = then the integral of this term is 


( 8 ) 


^ a u~ X exp(-X/°' ； 7) 
o 2 ^ 0 -jp * 


da ； 


°u 卜 1 e- u 'du； = r(^n) 


by definition of the gamma function (or by the fact that a ri /<r u has the 
^ 2 -density with n degrees of freedom). Hence the density of r" is 


(9) 




Theorem 7.6.1. If are independent^ each with distribution 

((7 ?t S^)l then the density of the sample correlation coefficients is given by 
(9) where - L 
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7.7. THE INVERTED WISHART DISTRIBUTION AND BAYES 
ESTIMATION OF THE COVARIANCE MATRIX 


7.7J* The Inverted Wishart Distribution 

As indicated in Section 3.4.2, Bayes estimators are usually admissible. The 
calculation of Bayes estimators is facilitated when the prior distributions of 
the parameter is chosen conveniently. When there is a sufficient statistic, 
there will exist a family of prior distributions for the parameter such that the 
posterior distribution is a member of this family; such a family is called a 
conjugate family of distributions. In Section 3.4.2 we saw that the normal 
family of priors is conjugate to the normal family of distributions when the 
covariance matrix is given. In this section we shall consider Bayesian estima¬ 
tion of the covariance matrix and estimation of the mean vector and the 
covariance matrix. 


Theorem 7.7.1* 

density 

(i) 


If A has the distribution Wi'L, m), then B = A’} has the 

|l|f|>|5| > e -4tr 1 


for B positive definite and 0 elsewhere，where 屯 = £ 叫 * 


Proof. By Theorem A.4,6 of the Appendix, the Jacobian of the tiansfor- 
mation A = is \B\ ~ (p+1) . Substitution of 5* 1 for A in (16) of Section 7.2 
and multiplication by \B\ yields (1). ■ 

We shall call (1) the density of the inverted Wishart distribution with m 
degrees of freedom 1 and denote the distribution by m) and the 

density by We shall call ^ the precision matrix or concentra¬ 

tion matrix. 


7.7.2. Bayes Estimation of the Covariance Matrix 

The covariance matrix of a sample of size iV from 2) has the distribu¬ 
tion of (l/n)A, where A has the distribution WiX, a) and n^N - \ m We 
shall now show that if 2 Is assigned an inverted Wishart distribution, then 
the conditional distribution of £ given A is an inverted Wishart distribution. 
In other words, the family of inverted Wishart distributions for 2 is conju¬ 
gate to the family of Wishart distributions. 

f The definition of the number of degrees of freedom differs from that of Giri (1977), p. 104 v and 
Muir head (1982), p r 113, 
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Theorem 7.7.2. If A has the distribution WiX, n) and 2 has the a priori 
distribution m\ then the conditional distribution of X is W~KA + 


Proof. The joint density of A and 1 


is 


( 2 ) 


I 屯 | l m | X| 一 k^ n+m +p+ 1 \ a\ 1 ^ 


for A and £ positive definite. The marginal density of A Is the integral of (2) 
over the set of 2 positive definite. Since the integral of (1) with respect to B 
is 1 identically in 屯 ， the integral of (2) with respect to 2 is 


r p [i(n +m)]|'P|>|/l|^ (fl ^- n |/l + ^r' ( 

「办 )1； (㈣ 


for A positive definite- The conditional density of S given A is the ratio of 
⑵ to (3) ， namely. 


⑷ 


1/1 + ^ "2 (/J Tlr(.4+^)X' 


2^— r ； [士 (n+m)] 

which is + 中 ， n + m), ■ 


Corollary 7*7 丄 If nS has the distribution WiX, n) and S has the a priori 
distribution then the conditional distribution of S given S is 


Corollary 7.7.2. If nS has the distribution W{X y n)^ 2 has the a priori 
distribution and the loss function is tr{D — 2)G(D - 1 )H, where 

G and H are positiue definite，then the Bayes estimator for S is 

(5) -- — 1 - . {nS + ^V), 

v / n + m—p— f 

Proof, It follows from Section 34,2 that the Bayes estimator for S is 
(i 0 (2|S), From Theorem 7,7.2 we see that S ^ 1 has the a posteriori dtstribu^ 
tion W[(nS + +m]. The theorem results from the following lemma. 


Lemma 7.7.L If A has the distribution H 7 (I, ?0, then 



以一 1 



S' 
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Pivof. If C is a nonsingular matrix such that 2 = CC'，then A has the 
distribution of CBC\ where B has the distribution W(l 9 n\ and == 

. By symmetry the diagonal elements of are the 

same and the off-diagonal elements are the same; that is, SB~ X = k x l -h 
k 2 ss\ For every orthogonal matrix g ， QBQ' has the distribution PK(/, n) and 
hence S{QBQ')~ X = QSB^ X Q } = . Thus /c 2 = 0, A diagonal element of 

B 一 1 has the distribution of (Xn-p +\) ^ (See, e,g_, the proof of Theorem 
5.2,2,) Since + 1 -p - lr 1 , iy [ L Then (6) 

follows, 鼸 

We note that (n - p — l)A^ 1 = [(n— p— l)/(n - l)]S_ l is an unbiased 
estimator of the precision 2"* 1 . 

If (x is known, the unbiased estimator of 2 is (l/N)L^ { (x a - |x)(x a - 
(x)\ The above can be applied with n replaced by N. Note that if n (or N) is 
large. (5) is approximately S, 

Theorem 7*7.3. Let x x ^, ti x N be observations from iV(jx ， 2). Suppose |x 
and 5 ； have the a priori density n(jx| v ，（ l//OX) X Then the a 

posteriori density of |x and 2 given i = (1 /N)l!hx a , and S = (1 \( x a 

一 iXxa — xy & 

( 7 ) ( 觝 + 

( w 

X -f wS ^ + ^ (X - v)(x- v) f 9 N -h m • 

Proof. Since x and nS = A are a sufficient set of statistics, we can consider 
the joint density of x, A, |jl, and 2, which is 

-\(N + m+p + 2 )|^ | ) 

(8) —2 糾 +m + 1 ) 〜 n ； U(7^1)]r ； (im) 

,exp{ — 士[//(无 一 |x) , 2 -1 (x— jx) -f tr A'l^ 1 

■bK(ii-vyi^ l (iL-v) +tr 屯 S'* 1 ]}. 

The marginal density of x and A is the integral of (8) with respect to (x and 
2. The exponential in (8) is —長 times 

( 9 ) {N ■bK) i L , l^ i l i-2(Nx-bKv)l^iL 

■h x -b Kv r l~ { v tv(A + ^P)2 1 

= (/V + K)[^- 7 ^(Nr + Kv)[2- 1 [^- 7 ^(Nx + Kv) 
fsJK 

+ 1 frKrx-v)'l- l (x-v)+tv(A + ^)2-- 1 . 
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The integral of (8) with respect to jjl is 

^ISI 一士 ( A /+ m + p +1 

( 10 ) (N + X)^2^ (w+m ^ir^r p [i{^ - l)]r p (|m) 

•exp( —I trAX~ l +~^~^{x~vyi- l (x-v)+tT^l~ l 

In turn, the integral of (10) with respect to X is 
X-^r,[|(iV + m)] 

( >r p [i{N-i)]r p (|m)(N + /c)^ 

The conditional density of (x and 2 given x and A is the ratio of (8) to (11 )， 
namely, 


( 12 ) 


(iV+A ： )>|2「^ +m+/7+2) |^ + 1 4 + ^^(i-v)(i-v) , P A/+,n) 

一 ~一~ 2 ^(N +m + i ) P7J .^ T ^ N + m ^] 


•exp|-^(N + /C)[M-~ 77 ^(Nx + A ： v) l~ l il- -^^(Nx + Kv) 

+ tr ^+A+jj^(x~v)(x-v)' l~ l J. 
Then (12) can be written as (7X ■ 


Corollary 7.73. If x v ...,x N are observations from Mjjl ，SX if and 1 
haue the a priori density n[|x|v»(l//C)X] X m), and if the loss 

function is (d — jjl)W — jjl) 一 tr(D 一 X)G(D — X)H ， then the Bayes estima- 
tors of jx and S are 

(13) aTTx(JVx + ^v) 
and 

(14) N + m 1 _ p _ l \nS + M f + 1 ^ K (x-v)rx-vy\, 

respectively. 


The estimator of jjl is a weighted average of the sample mean x and the 
a priori mean v* If N is large, the a priori mean has relatively little weight. 
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The estimator of i is a weighted average of the sample covariances S y 平， 
and a term deriving from the difference between the sample mean and the a 
priori mean. If N is large, the estimator is close to the sample covariance 
matrix. 

Theorem 7.7.4. If x、、 ••” x N are observations from Mjjl, S) and if jjl and 
X have the a priori density v,(l /K)^] X w 一 l (2| 屯， m )，then the marginal 

a posteriori density of jjl given x and S is 

(15) 

_ (N + Ky^rjkiN + m + l^lBl^ _ 

7T^r[|(N + m + 1 -/?)][l+ (N + X)(jt- * 

where jjl* is (13) and B is — l times (14 乂 

Proof. The exponent in (12) is — ^ times 

(16) tr[B + (W + if)(ji — p)’]X 一 1 . 

Then the integral of (12) with respect to X is 

( ) TT^T p [\{N + m)]\B + {N + K){iL- + 

Since 15 +jiar r | — |5|(1 +x / 5~ 1 x) (Corollary A.3.1), (15) follows. ■ 

The density (15) is the multivariate f distribution with N -h m + 1 — p 
degrees of freedom. See Section 2,7.5, Exarrples, 

7.8, IMPROVED ESTIMATION OF THE COVARIANCE MATRIX 

Just as the sample mean x can be improved on as an estimator of the 
population mean |jl when the loss function is quadratic, so can the sample 
covariance S be improved on as an estimator of the population covariance S 
for certain loss functions. The loss function for estimation of the location 
parameter jjl was invariant with respect to translation -f a, jjl — jjl + a\ 

and the risk of the sample mean (which is the unique unbiased function of 
the sufficient statistic when X is known) does not depend on the parameter 
value. The natural group of transformations of covariance matrices is multi¬ 
plication on the left by a nonsingular matrix and on the right by its transpose 
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(x Cx, S — CSC\ X CXC ). We consider two loss functions which are 
invariant with respect to such transformations. 

One loss function is quadratic ； 

(1) L g (5 ：， G) =tr(G ——】（G — X)X 一】 

= tr(GX-' -I)', 

where G is a positive definite matrix. The other is based on the form of the 
likelihood function: 

(2) L / (X,G)-trGX- 1 -loglGX- 1 ! _ P . 

(See Lemma 3,2.2 and alternative proofs in Problems 3.4, 3.8, and 3.12.) Each 
of these is 0 when G = X and is positive when G # X. The second loss 
function approaches oo as G approaches a singular matrix or when one or 
more elements (or one or more characteristic roots) of G approaches oo. (See 
proof of Lemma 3.2.2.) Each is invariant with respect to transformations 
G* = CGC\ X* = We can see some properties of the loss functions 

from L q (I y D) = [(‘ 一 l) 2 and £；(/, D) = E/L - log d u - 1), where 

D is diagonal. (By Theorem A.2.2 of the Appendix for arbitrary positive 
definite X and symmetric G, there exists a nonsingular C such that CXC' - / 
and CGC r = D) If we let g = (^ n ,,.., g ppy g 12l ... > s = 

> s ppy Si 2 9 • ^ • y s p _i p ) y cr = (c7n，.. •, o" l2 .j. p) 3 and 0 = 

— 0 -)( 5 -or)', then L q {XG) is a constant multiple of {g-crY^~ l {g-cr\ 
(See Problem 7.33.) 

A 

The maximum likelihood estimator X and the unbiased estimator S are of 
the form aA, where A has the distribution n) and n = N — L 

Theorem 7.8.1. The quadratic risk of aA is minirtiized at a — \ / {n -f p + 1). 
and its value is pip + l)/(n +/? + 1). The likelihood risk of a A is minimized at 
a = l/n (i.e., aA^S), and its value of plogn - LfLi ^ log 尤二 卜 厂 

Proof. By the invariance of the loss function 

(3) = 

= S ! lx{aA ,f -I) 2 

=^/ a 2 E < - 2a E<+P 

\ I, / = I ( » I 

- a 2 [(2n + n 2 )p + np( p - 1)] - lanp 
= p[n(n + /? + l)a 2 - 2na + l], 
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which has its minimum at a = \/{n + 尸 + 1). Similarly 

(4) = 

= cr l {atrA*- \og\A*\ -p log a -p} 

=p [ na - log a - 1 ] 一 c?, log| I ， 

which is minimized at a = \/n. ■ 

Although the minimum risk of the estimator of the form aA is constant for 
its loss function，the estimator is not minimax. We shall now consider 
estimators G{A) such that 

(5) G{HAH) = HG{A)H f 

for lower triangular matrices H. The two loss functions are invariant with 
respect to transformations G* = HGH\ X* = HXH\ 

Let ,4 = / and H be the diagonal matrix D t with — 1 as the ith diagonal 
element and 1 as each other diagonal element. Then HAB f = /, and the 
i，Jth component of (5) is 

⑷ g (/ (/) = -g, ; (/), j*i. 

Hence, g” ⑴= 0 ， i 參 j, and G(/) is diagonal, say D. Since A = TT f for T 
lower triangular, we have 

(7) G{A)=G(TIT f ) 

=TG{I)T f 
=TDT y 

where D is a diagonal matrix not depending on A. We note in passing that if 
(5) holds for all nonsingular //， then D = al for some a. (H can be taken as 
a permutation matrix.) 

If X = KK\ where K is lower triangular, then 

( 8 ) 

/vL[X,G(^)] = fL[l,G{A)]C{p,n)\l\ - 

dA 
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=f L[KK'^(KA^K^C^p, e~ dA* 

= ^LlKK^KGiA*)^] 

= ^L[I,G(A*)] 

by invariance of the loss function. The risk does not depend on S. 

For the quadratic loss function we calculate 

(9) alJ/ ， g(_4)] = W ， ror] 

=^ U(TDT' -I) 1 
=tr(TDT'TDT' - 2TDT' +1) 

P P 

= E ’" 々 fc/it〆〆" _ 2 A E tfjd } + p. 

I i,]=l . 

The expectations can be evaluated by using the fact that the (nonzero) 
elements of T are independent, tf, has the ^ 2 -distribution with n + 1 - / 
degrees of freedom, and / Iy , i >j, has the distribution N(0, 1). Then 

(10) ^Lj/,G(/ 1 )] =d'Fd-2rd+p, 
where F=(f i} ), /= if；), 

(11) //, = (« +P - + 1)(« +p~2i + 3), 

f i} = n +p-2j+ 1, i <j, 

= n + p + 2/ + 1 ， 

and d = {d v .,., d p )\ Since d r Fd = W tr {TDT f ) 2 > 0, F is positive definite 
and (10) has a unique minimum. It is attained at d =F~ l f 9 and the minimum 
isp-m 

Theorem 7-8.2. With respect to the quadratic loss function the best estimator 
invariant with respect to linear transformations S A HAH\ where 

H is lower triangular, is G(a) = TDT\ where D is the diagonal matrix whose 
diagonal elements compose d = F~ v f y F and f are defined by (11 )，arid A = 7T / 
with T lower triangular. 
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Since d=F~ l f is not proportional to s =(1,...,1)\ that is, Fe is not 
proportional to / (see Problem 7.28), this estimator has a smaller (quadratic) 
loss than any estimator of the form aA (which is the only type of estimator 
invariant under the full linear group). Kiefer (1957) showed that if an 
estimator is minimax in the class of estimators invariant with respect to a 
group of transformations satisfying certain conditions, 1 then it is minimax 
with respect to all estimators. In this problem the group of triangular linear 
transformations satisfies the conditions, while the group of all linear transfor¬ 
mations does not. ^ 

The definition of this estimator depends on the coordinate system and on 
the numbering of the coordinates. These properties are intuitively unappeal¬ 
ing. 


Theorem 7.8.3. The estimator G(A) defined in Theorem 7.8.2 is minimax 
with respect to the quadratic loss function. 


In the case of p = 2 


( 12 ) 


(AZ + 1 广 一 (/! 一 1) 

(n + l) 2 (n + 3) - （m - 1) 


(n + 1)(m + 2) 

{n + l)**(n + 3) — （M — 1) 


The risk is 
(13) 


3n 2 -f 5n 4- 4 
+ 5n 2 + 6n -f 4 


The difference between the risks of the best estimator aA and the best 
estimator TDT r is 


.. 6 — 6n 2 + lOn + 8 — _ 2n(n — 1)_ 

^ + 3 n 3 + 5n 2 + 6n + 4 (n + 3)(n 3 + 5n 2 + 6n + 4) * 

The difference is ^ for n = 2 (relative to f), and ^ for n = 3 (relative to 1); 
it is of the order 2/n 2 ; the improvement due to using the estimator TDT r is 
not great，at least for p = 2. 

For the likelihood loss function we calculate 

(15) W ， G ㈤] 

= S,L,[I,TDT'] 

= S,[tx TDT' ~\og\ TDT'\ -p] 

tjhe essential condition is that the group is solvable. See Kiefer (1966) and Kudo (1955), 
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=^i E tf, d j- E ] og^ - E lQ g d ，-p 

i i= l /= \ 

p p p 

=L (n+p-2/+ l)d } - E \ogd } - E ^'og X ； L\ ， - P’ 

卜 l / =] 尸丨 

The minimum of (15) occurs at ^ = 1 / {n 4- p - 2j + 1), j = 1,,,., p. 

Theorem 7.8*4. With respect to the likelihood loss function, (he best estima¬ 
tor invariant with respect to linear transformations 2 H^H\ A -> HAH\ 
where H is lower triangular，is G(A) = TDT\ where the jth diagonal element of 
the diagonal matrix D is 1 /(n + p — 2j 4- 1), j = 1 ， …， p ， and A = TT', with T 
lower triangular. The minimwn risk is 

(16) S^LllyGiA)] - E logO+p - 2 j+I) - E ^ Xn + i-j^ 


Theorem 7.8.5. The estimator G{A) defined in Theorem 7.8.4 is minimax 
with respect to the likelihood loss function. 

James and Stein (1961) gave this estimator. Note that the reciprocals of 
the weights l/(n + /? — 1) ， 1/(" + p — 3 )， — \/{n —/) + 1) are symmetrically 
distributed about the reciprocal of l/n. 

If P-2, 



The difference between the risks of the best estimator aA and the best 
estimator TDT r is 


(19) plogn- Y, log(« + p - 2j + 1) = - £ log( 1 + 匕 + 1 ] • 
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If p = 2. the improvement is 

(20) -log(l + i) - log(l - ^-) = -log(l - 

1 1 1 

= ■ 4 - - 4 - - 4 - ••• 

n 2 2" 4 3n 6 ’ 

which is 0.288 for n = 2 y 0.118 for n = 3, 0.065 for ^ = 4, etc. The risk (19) is 
0(1 /n 2 ) for any p. (See Problem 7.31.) 

An obvious disadvantage of these estimators is that they depend on the 
coordinate system. Let P t be the ith permutation matrix, / = 1， • • • ， p!，and iet 
P t AP' t = r ; r/, where 7) is lower triangular and t j} > 0, j = 1,.Then a 
randomized estimator that does not depend on the numbering of coordinates 
is to let the estimator be P , l T i DT t f P i with probability \/p\\ this estimator has 
the same risk as the estimator for the original numbering of coordinates. 
Since the loss functions are convex, {l/p^)T. i P , l T t DT t l P i will have at least as 
good a risk function ； in this case the risk will depend on X. 

Haff (1980) has shown that G{A) = [l/(n + p + l)](A -f yuC\ 
where y is constant, 0 < y < 2(p - l)/(n -/? + 3)，w = l/tT(A~ { C) and C is 
an arbitrary positive definite matrix, has a smaller quadratic risk than 
[\/(n + p + 1 )] 儿 The estimator G(^4) = (1 /n)[A -f ut(u)C], where t{u) is an 
absolutely continuous, nonincreasing function, 0 < f(w) < 2(/? — l)/n 9 has a 
smaller likelihood risk than S. 


7.9. ELLIPTICALLY CONTOURED DISTRIBUTIONS 
7.9.1. Observations Elliptically Contoured 

Consider x p ,,,,x v observations on a random vector X with density 

(I) |A| 、 [(x-v)]. 

Let A = E^,(x a -xXx a -x)\ n = N- 1, S = Then S 厶 S as 

N oo. The limiting normal distribution of )/N vec(S — X) was given in 
Theorem 3.6.2. 

The lower triangular matrix T, satisfying A = TT\ was used in Section 12 
in deriving the distribution of A and hence of S. Define the lower triangular 
matrix T by S - TT\ t n > 0, i = 1 ， … ， p. Then T ^={1/ 4n)T. If i then 



7.9 ELLIPTICALLY CONTOURED DISTRIBUTIONS 


283 


S 二 I and T I, \/n"(S - /) and : /N(T-1) have limiting normal distribu¬ 
tions, and 

(2) \/N(5-/) 0^(1). 

That is, yfN(s n - 1 卜 2yfN(t h _ 1) + 0/1)，and 极 = 爾 "+ 0/1)，/ >/• 
When S ^ /, the set (s u - 1),..., : /N{s pp — 1) and the set 4tfs 中 i > j\ 
are asymptotically independent; V^N"5 12 ,..., }fNs p - V p are mutually asymptot¬ 
ically independent, each with variance 1 +k; the limiting variance of 
/N (s u 一 1) is 3k + 2; and the limiting covariance of /N (s ti - 1) and /N (sjj 
一 1)， is k . 

Theorem 7,9.1. If 1 =/ p , the limiting distribution of^N (T-I p ) is normal 
with mean 0. The variance of a diagonal element is (3k -f 2)/4; the covariance of 
two diagonal elements is k/4; the variance of an off-diagonal element k + 1; 
the off-diagonal elements are uncorrelated and are uncorrelated with the diagonal 
elements. 

Let X — v ^ CV t where Y has the density g(y f y\ A = CC\ and S = 

一 v)(X- v) 1 = (<S > R 2 /p)A — rr' and C and T are lower triangular. Let S 
be the sample covariance of a sample of N on X. Let S = TT\ Then S X, 
f4r, and 

(3) V^(5-1) = i/N(f-r)r , -f rv/N(f-r) / -f o^(i). 

The limiting distribution of }/T(T~r) is normal, and the covariance can be 
calculated from (3) and the covariances of the elements of (S — 1). Since 
the primary interest in T is to find the distribution of 5, we do not pursue 
this further here. 


7.9.2. Elliptically Contoured Matrix Distributions 
Let X (Nxp) have the density 

(4) IClIc-V-eWp-e〆，)- 1 ] 

based on the left spherical density g(Y f Y). 


Theorem 7.9.2. Define T = (t t} ) by Y Y^ TT\ t i} . = 0, i <and t xi > 0. If 
the density of Y is g{Y f Y\ then the density of T is 


( 5 ) 


p 

n 


r[^(N +1 - /)] 




2^ Np 

Wo 


t^ (7T,) . 



284 


COVARIANCE MATRIX DISTRIBUTION ； GENERALIZED VARIANCE 


Proof. Let K= (v v ,.,,v p ). Define w { and w,'recursively by u x - 



and = w,/||w,4 Then w\w^ = 0, u\u } = 0, z 句 •，and = 1. Conditional on 
,...,! (that is, let Q t be an orthogonal matrix with 

u\, • • •, as the first / 一 1 rows; that is, 

(See Lemma A.4.2.) Define 



This transformation of v { is linear and has Jacobian 1. The vector z* has 
N + 1 — 【 components. Note that ||z*|| 2 = \\w t \\ 2 9 


(9) 

v t = 

j~\ I 


(10) 


e + z* = h” 

/ =i ；=i 


(11) 


j 

H ’从 ’iA ， 

i< 


k=l 


The transformation from (v v ... y v p ) to z t ,...,z p has Jacobian 1 . 

To obtain the density of T convert z* to polar coordinates and integrate 
with respect to the angular coordinates. (See Section 2.7.1.) ■ 

The above proof follows the lines of the proof of (6) in Section 7.2, but 
does not use information about the normal distribution, such as t\ = 

See also Fang and Zhang (1990), Theorem 3.4.1. 

Let C be a lower triangular matrix such that A = CC，, Define X — YC f . 

Theorem 7.9.3. If X {N 乂 p) has the density 

(i2) icr w g[m(c ") _l ]， 
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then the lower triangular matrix T* satisfying X*X = T* T* f and t n > 0 has the 
density 

)p^JiSp p r _ 

(13) 

Let A=X'X=T*T*'. 


(14) 


Theorem 7.9.4. If X has the density (12 )， then A ~ X'X has the density 




UI_ _p+ "g[c-U(c，) _1 .. 


The class of densities g(tr Y f \) is a subclass of densities Let 

X - C/vV f + yC\ Then the density of X is 

(15) \\\-^g[tt(X-z N v')/^-\X-e Nl iy}, 

A stochastic representation of X is vtcX^= R{C® / /V )vec U z N . Theo¬ 

rems 7.9.3 and 7,9.4 can be specialized to this form. Then Theorem 3.6.5 
holds. 


Theorem 7.9.5. Let X have the density (12) where A. /i 1 diagonal. Let 
S=(N - e^x') and R - (diag S)~ =S(diag S)" i Then 

the density of R is (9) of Section 7.6. 


PROBLEMS 

7J. (Sec. 7.2) A transformation from rectangular to polar coordinates is 

= wsin 

y 2 = wcos sin 8 2 , 
y 3 = wcos 6 X cos 0 2 sin 0 3 , 

y n _ } — wcos cos 0 2 … cos 8 fl _ 2 sin , 

y n ^w cos cos B 2 *** cos 6 n _ 2 cos . 

where — \rr < 0 t ^ \rr, i = 1 ， • . • ， m — 2 ， —rr< 6 n _^ < tt, and 0 < 
W < 00. 
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(a) Prove w 2 - fMiru: Compute in turn y„ + yn- 1 ，（少 n 2 + 少 „ 2 - 1 ) + 少《 2 - 2 ， arui 
SO forth.] 

(b) Show that the Jacobian is V】 cos 卜 2 心 cos n_3 0 2 r,m cos d n ^ 2 - [Hint Prove 


^ ( } \ . ) } n ) 

以巧】 .•… 心 -1， W ) 


cos d x 0 
0 cos 0 2 


0 0 
0 0 


0 0 
\v sin 8 } h- sin 8 2 


cos0 n -, 0 
Kvsinft^, 1 


IV X … X 

0 w cos 8 { • • A 


X 

X 


w cos 6 } 
0 


cos%_: 


1 


where x denotes elements whose explicit values are not needed.] 

7,2. (Sec. 7.2) Prove that 


-r- co ^ 9de= rmra) 

-n/2 r[i(/i + i)] 


[Him: Let cos 2 G = u, and use the definition of B(p 7 q\] 


7.3. (See. 7,2) Use Problems 7.1 and 7.2 to prove that the surface area of a sphere of 
unit radius in n dimensions is 


2Tr"- n 

7.4. (Sec. 7.2) Use Problems 7.1, 7.2, and 13 to prove that if the density of 

y 1 is fiy'yX then the density of u =y r y is \C(n)f(u)u 

7.5. (Sec, 7.2) x 2 -dismbution, Use Problem 7.4 to show that if 少 i ， … ，少 „ are 
independently distributed, each according to N(0, 1), then U= \ y\ has the 
density u^ 1-1 e~ ^ u /[2^ fl r(^rt)], which is the ^ 2 -density with n degrees of 
Irccdom. 

7.6. (Sec. 7.2) Use (9) of Section 7.6 to derive the distribution of A. 

7.7. (Sec. 7.2) Use the proof of Theorem 7,2.1 to demonstrate Pr{|A| = 0} — 0. 
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7.8. (Sec, 7.2) Independence of estimators of the parameters of the complex normal 

distribution. Let z p ...,z lV be N observations from the complex normal distribu¬ 
tion with mean 9 and covariance matrix P. (See Problem 2,64.) Show that Z 
and /4 = —Z)(Z a -Z)* are independently distributed, and show that 

A has the distribution of where W ]9 ,.. J W n are independently 

distributed, each according to the complex normal distribution with mean 0 and 
covariance matrix P, 

7.9. (Sec, 12) The complex Wishart distribution. Let be independently 

distributed, each according to the complex normal distribution with mean 0 and 
covariance matrix P. (See Problem 2.64.) Show that the density of 5 = 
^ W & 

7.10. (Sec. 7.3) Find the characteristic function of A from W(l, 9 n). [Hint: From 
fw(A\ l, y n)dA- one derives 

r l4l 如？ —"exp (- 士 t r O — l4)d4 
J 2 - * 

as an identity in 屯 .} Note that comparison of this result with that of Section 
7.3.1 is a proof of the Wishart distribution. 

7.11. (See. 7.3.2) Prove Theorem 7.3.2 by use of characteristic functions. 

7.12. (Sec, 7.3.1) Find the first two moments of the elements of A by differentiating 
the characteristic function (11). 

"M3. (Sec. 7.3) Let Z 】， ." ， Z„ be independently distributed, each according to 
M0, /)• Let W = H n a% ^\b a ^Z a Z^. Prove that if a f Wa = \m ^ a such that 
a r a = 1, then W is distributed according to W(I, m\ [Hint: Use the characteris¬ 
tic function of a r Wa.] 

7.14. (Sec. 7.4) Let x a be an observation from 2), a = l”. .， /V ， where z a is 

a scalar. Let L a z oc x oc /H oc Use Theorem 7.4.1 to show that H a x a x! a - 
bb r ， Z a zl and bb’ are independent. 

7.15. (Sec. 7.4) Show that 

A'/v-i XN- 2 ) = 义 2 2 w"4/4) ， h^.0 y 

by use of the duplication formula for the gamma function; Xn-] and x ^2 are 
independent. Hence show that the distribution of Xn-z iS distribution 
of Af2 2 /V-4/4. 
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7.16. (Sec. 7,4) Verify that Theorem 7.4.1 follows from Lemma 7.4,1, [Hint Prove 
that Q i having the distribution r t ) implies the existence of (6) where I is 
of order and that the independence of the Q/s implies that the / 5 s in (6) do 
not overlap.] 

7.17. (Sec. 7.5) Find S\A\ h directly from WiX.n). [Hint: The fact that 

jw(A\X, n)dA = I 

show.s 


f MP (n ^- ,) exp(-itr2-U)^ = 2^|X|^r p (in) 

as an identity in n.) 

7.18. (Sec. 7.5) Consider the confidence region for |jl given by 

H 尽，■”⑷， . 

where x and S are based on a sample of N from N(|JL, X). Find the expected 
value of the volume of the confidence region. 

7.19. (Sec. 7.6) Prove that if X = /, the joint density of r if , p ，= 1,...,/? - 1, and 

厂 ip ，-"， 厂 is 

r(K> 

where R u . p = (r i; . p ). [Hint: = (r i} - r ip r ip ) / {yj 1 - rf p y^l - rf p ) and \r^\ 

I ^-^f P r n-p\- Use (9).] 


(M) 


iU-3) 




7.20. (Sec. 7.6) Prove that the joint density of r ll x tP ,r^ 

r \pi • ■ _ ， r p- 1 . p 

r {|[» ~ (P ~ 2 )]} /-■ 2 \iln-(p+Dl 

d 「 U[«-(P-l)]}( _ 12 ’ 3 ■•… ^ 




-P) 


=! ^r{K«-(P-2)]} v 






i(n-3) 


[Hint: Use the result of Problem 7,19 inductivity.} 
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7.21. (Sec. 7.6) Prove (without the use of Problem 7.20) that if X = /, then 
r ipi • • • ， 厂 p are independently distributed. [Hint: r ip - a ip / 」 a pp )• Prove 
that the pairs (a lp , a n ),.. ., (a p ^ { a p ^ ^ are independent when 
(z lp ,..., z np ) are fixed, and note from Section 4.2,1 that the marginal distribu¬ 
tion of r ipy conditional on z ap9 does not depend on z ap ] 

7.22. (Sec. 7.6) Prove (without the use of Problems 7.19 and 7.20) lhal if 2 - then 

the set 厂 " is independent of Ihe set i,j= l. p - l. [ Hint: 

From Seclion 4,3.2 a pp , and (a tp ) arc independent of (〜"). Prove that 
o fyi ,Xct iiy ), and a iiy i= - 1, arc independent of by proving lhat 

a ti . p are independent of (r^ . p ). See Problem 4.21.] 

7<23. (Sec. 7.6) Prove the conclusion of Problem 7.20 by using Problems 7.21 and 

7.22. • ^ 

7.24. (Sec. 7.6) Reverse the steps in Problem 7.20 to derive (9) of Section 7.6. 

7.25. (Sec. 7.6) Show that when = 3 and S is diagonal r 12 ， r l3 ， r 23 are not 
mutually independent. 

7.26. (Sec, 7.6) Show that when X is diagonal the set r l} are pairwise independenl. 

121. (Sec. 7.7) Multivariate (-distribution. Let y and u be independently distributed 
according to iV(0, X) and the ^-distribution, respectively, and let yjn/uy — 

(a) Show that the density of x is 

_ r[l(n+p)] _ 

i i I 『 1 x(n+p) 

r({n)n^7r^\l\^ 1 —-(x—|jt) 

(b) Show that |x and 

7«28. (Sec. 7.8) Prove that Fe is not proportional to / by calculating Fe. 

7»29. (Sec. 7.8) Prove for p = 2 

0 ' 

TDT f {d 2 -d A ) Q L4[. 

7JO. (Sec. 7.8) Verify (17) and (18). [Hint ： To verify (18) let A - K4*K', 

and A* - T*T* y where K and T* are lower triangular.] 
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7J1. (Sec. 7.8) Prove for oplimal D 




^L^I.S)- f,L,{I,TDr)= - I ： log 


^(p-n 

E i°g 


p-2i+ 1 \ 2 


I p — 2i + 1 


,21 


p even, 


p odd. 


7.32, (Sec. 7.8) Prove L f/ (2,G) and L,(2 ， G) are invariant with respect to transfor¬ 
mations G* == CGC\ 2* — CSC’ for C nonsingular. 


7.33. (Sec. 7.8) Prove L^d.G) is a multiple of (g — cr)’4>—Kg - cr). Hint: Trans¬ 
form so X I. Then show 


n 

7.34. (Sec. 7.8) Verify (11). 

7.35. Lei the density of Y be f(y)=^K for y'y < p -f 2 and 0 elsewhere. Prove that 
K=r({p + l)/[(p + 2)tt}^\ and show that = 0 and SYY 1 = /. 


736. (Sec 、 7.2) Dinchlet distribution. Let y [t ...,y jn be independently distributed as 
.V 2 -variables with p h … ， p /n degrees of freedom ， respectively. Define Z i ― 
K， f = 1 — , rn. Show that the density of Z l9 ,.. y Z rrt _ l is 


个 — 1 … 命 ' 



for z f ^ 0, i = 1,_ m. 

7.37. {Sec. 7.5) Show that if and Xn-i are independently distributed, then 
xl I Xs : ^ distributed as ( 义 2 2 v A ) 2 /4. [Hint: In the joint density of x = Xn-\ 
and y = x.\ - : substitute z = 2yxy^ x -x, and expicss the marginal density of z 
as z、- Vz(z). where h(z) is an integral with respect to x. Find h'iz), and solve 
iIk* dilTerciuial equiiliou. See SrivaMnvii iuhI Klnitri (1 1 )7 ()乂 Chaplcr 3.] 



CHAPTER 8 


Testing the General Linear 
Hypothesis; Multivariate 
Analysis of Variance 


8 丄 INTRODUCTION 

In this chapter we generalize the univariate least squares theory (i‘e- ， regres¬ 
sion analysis) and the analysis of variance to vector variates. The algebra of 
the multivariate case is essentially the same as that of the univariate case. 
This leads to distribution theory that is analogous to that of the univariate 
case and to test criteria that are analogs of /^-statistics. In fact, given a 
univariate test, we shall be able to write down immediately a corresponding 
multivariate test. Since the analysis of variance based on the model of fixed 
effects can be obtained from least squares theory, we obtain directly a theory 
of multivariate analysis of variance. However, in the multivariate case there is 
more latitude in the choice of tests of significance. 

In univariate least squares we consider scalar dependent variates 
drawn from populations with expected values p% ， … ， P’Z/v ， respectively, 
where p is a column vector of q components and each of the z (r is a column 
vector of q known components. Under the assumption that the variances in 
the populations arc the same, the least .squares estimator of is 

(l) y 


L 〜 4 L 
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If the populations are normal, the vector is the maximum likelihood estima¬ 
tor of p. The unbiased estimator of the common variance a 2 is 

(2) E {x a -b'z a ) 2 /{N-q), 

a=l 

and under the assumption of normality, the maximum likelihood estimator of 
a 2 is a 2 = (N - q)s 2 /N. 

In the multivariate case x a is a vector, is replaced by a matrix p，and 
a 2 is replaced by a covariance matrix 2. The estimators of P and X， given 
in Section 8.2, are matric analogs of (1) and (2), 

To test a hypothesis concerning p, say the hypothesis p = 0, we use an 
F-tcst. A criterion equivalent to the f-ratio is 


(3) 



where (j 0 2 is the maximum likelihood estimator of a 2 under the null 
hypothesis. We shall find that the likelihood ratio criterion for the corre¬ 
sponding multivariate hypothesis, say P = 0, is the above with the variances 
replaced by generalized variances. The distribution of the likelihood ratio 
criterion under the null hypothesis is characterized, the moments are found, 
and some specific distributions obtained. Satisfactory approximations are 
given as well as tables of significance points (Appendix B). 

The hypothesis testing problem is invariant under several groups of linear 
transformations. Other invariant criteria are treated, including the 
Lawley-Hotelling trace, the Bartlett - Nanda-Pillai trace, and the Roy maxi¬ 
mum root criteria. Some comparison of power is made. 

Confidence regions or simultaneous confidence intervals for elements of P 
can be based on the likelihood ratio test, the Lawley-Hotelling trace test, 
and the Roy maximum root test. Procedures are given explicitly for several 
problems of the analysis of variance. Optimal properties of admissibility, 
unbiasedness, and monotonicity of power functions arc studied. Finally, the 
theory and methods are extended to elliptically contoured distributions. 


8-2. ESTIMATORS OF PARAMETERS IN MULTIVARIATE 
UNEAR REGRESSION 

8.2.1. Maximum Likelihood Estimators; Least Squares Estimators 

Suppose JCj,..., are a set of N independent observations, x a being drawn 
from iV(Pz ft , X). Ordinarily the vectors z a (with q components) are known 
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vectors, and the p 乂 p matrix £ and the pXq matrix p are unknown. We 
assume N >p q and the rank of 

(1) 2 ： = (z,,...,z^) 

is q. We shall estimate £ and P by the method of maximum likelihood. The 
likelihood function is 

(2) L = exp t(x a -^z a ) 

a= I 

In (2) the elements of 2* and p* are indeterminates. The method of 
maximum likelihood specifies the estimators of 2 and p based on the given 
sample z,,.. z n as the 2* and p* that maximize (2), It is conve¬ 
nient to use the following lemma. 

Lemma 8.2.1. Let 

N N 

(3) 5= E x a z' a E z„z' a 

a=I \I 

Then for any p Xq matrix F 

(4) E (x u ~ Fz a )(x a - Fz a )' = E ( x c,~ B Za){x a -BzJ' 

a =I 

N 

+ {B-F) Lz a z' a {B-F)'. 

Cc= 1 

Proof The left-hand side of (4) is 

(5) E [(^- J BzJ+(J?-F)z a ][(x a -B Za )+(B-F)z a ] f 5 

a=* I 

which is equal to the right-hand side of (4) because 

N 

⑹ E z a ( 〜 -BzJ’=0 

a - I 

by virtue of ⑶. _ 
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The exponential in L is - times 

⑺ 

N N 

ti'y E E (x a -Bz a )(x a -Bz a y 

f»r= I a— I 

+ tr p*)% 

where 

N 

(8) E z a z' Q . 

Of = I 

The likelihood is maximized with respect to P by minimizing the last term 
in (7). 

Lemma 8.2.2, If A and G are positive definite, trFAF'G > 0 for F ^0. 

Proof. Let HH\ G- KK\ Then 

(9) ir FAFG^ tr FHH ， F f KlC = tr KFHHFK 

^tv(K , FH)(K , FHy>0 

for F # 0 because then K f FH ^ 0 since H and K are nonsingular. ■ 

It follows from (7) and the lemma that L is maximized with respect to p* 
by P* ― B, that is, 

( 10 ) ^ = CA \ 

where 

N 

(11) 

a = 1 

Then by Lemma 3.2.2, L is maximized with respect to 2* at 

(⑺ t = ^ E (〜 成)(〜 

<t= 1 

This is the multivariate analog of <j 2 = (N — q)s 2 /N defined by (2) of 
Section 8.1. 

Theorem 8.2.1. Ifx a is an observation from N(^z a , 2), a = 1”" ， N, with 

(zj. ) of rank q, the maximum likelihood estimator of P is given by (10 )， 

where C= and A = E‘ r z„z: r . The maximum likelihood estimator of 1 

is given by (12). 
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A useful algebraic result follows from (12) and (4) with F- 0; 

N N 

(13) 1X^4 卜 

a =I a*I 

Now let us consider a geometric interpretation of the estimation proce¬ 
dure. Let the ith row of (jc 1? . be xf (with jV components) and the ith 

row of (zp^^z^r) be zf (with N components). Then L } p i} zf, being a linear 
combination of the vectors zfz*, is a vector in the g-space spanned by 
zf, •“ ， zj, and is in fact, of all such vectors, the one nearest to x*; hence, it 
is the projection of xf on the g-space. Thus x* is the vector 

orthogonal to the g-space going from the projection of xf on the g-space to 
x* Translate this vector so that one endpoint is at the origin. Then the set of 

' A A 

p vectors xf — , a:* — E ; . (3 pj zJ is a set of vectors emanating from 

the origin. N& u -(xf- E ; $ l} z* ){xf - Ey 先 zf V is the square of the length 
of the ith such vector, and N& (j = (xf - L h p lh zt)(x* - Z g /3 ； g z p f 
product of the length of the ith vector, the length of the ;th vector, and the 
cosine of the angle between them. 

Tlie equations defining the maximum likelihood estimator of p, namely, 
AB r : C ，， consist of p sets of q linear equations in q unknowns. Each set 
can be solved by the method of pivotal condensation or successive elimina¬ 
tion (Section A.5 of the Appendix)* The forward solutions are the same 
(except the right-hand sides) for all sets. Use of (13) to compute N't involves 

A A 

an efficient computation of 

Let X a = (x la ,...,x pa )' t B = and p =(p p …， p〆. Then 

h ia = and is the least squares estimator of p r If G is a positive 
definite matrix, then GH n a ^i(x a — Fz a )(x a — Fz a ) r is minimized by F= B. 
This is another sense in which B is the least squares estimator. 

8^,2. Distribution of 0 and X 

Now let us find the joint distribution of (3 Jg (i= 1,..., /?, g = 1， • • • ， q). The 
joint distribution is normal since the p ig are linear combinations of the X ia . 
From (10) we see that 

N 

(14) 

a =» 1 
N 
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Thus p is an unbiased estimator of p. The covariance between p: and p; 
two rows of p ， is 


(15) 


N 


^(P,-P,)(P ； 一 P; 卜 ， Y E d — ^, a )z a E {x h - ^x )y ) z ' y A- 

Of® 1 y— 1 

= E ^(X ia -c^X ia )(X jy ~^X. y )z a z' y A- 1 

a, y= 1 
N 

= E 5„ y o^ Z „ Z ’ yi 4-i 

rr. y* l 
N 

—E %Z a Z f a A~ l 




A' l AA' 


乂 /T 1 . 


To summarize, the vector of pq components (p\ ， … ， p’ p )’= vec 自 ， is nor- 


( 16 ) 


h mean 

(P\ ，…， P;)，= 

veep' and covariance matrix 

a ， 1 

°-12 •- 

_ %，” 



< X 2 2 A~ l .. 

_ °" 2 〆 -1 

. 

o- pl A~' 

°-p2^~ [ ■- 

- W l , 



The matrix (16) is the Kroncckcr (or direct) product of the matrices £ and 
A~\ denoted by £ 

From Theorem 4.3.3 it follows that /V2 = — p/ip' is dis¬ 
tributed according to N — q\ From this we see that an unbiased 

estimator of 2 is S = [N/(N - q)]l. 

A 

Theorem 8.2.2. The maximum likelihcod estimator p based on a set of N 
observations, the ath from iV(Pz a , £)，/s normally distributed with mean p, and 

A . 

the covariance matrix of the ith and jth rows of ^ L a t) A~ l , where A = zL a z 0[ z , 0[ - 
The maximum likelihood estimator % multiplied by N is independently dis¬ 
tributed according to fV(X, N — q), where q is the number of components ofz a - 
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The density then can be written [by virtue of (4)] 

(17) (-中 卞 自， < 自 - P ) ，+ Nt l })- 

This proves the following: 

A A 

Corollary 8.2.1. p and 2 form a sufficient set of statistics for p and 2. 

A useful theorem is the following. 

Theorem 8.2.3. Let X a be distributed according to 2), a = l,_ N 、 

and suppose , X N are independent. 

(a) // w a — Hz a and V = 1 ， then X Q is distributed according to 

N(rw a ,l). 

(b) The maximum likelihood estimator of T based on observations x a on X n , 

A A A 

••” N, is r 完 p //"where p is the maximum likelihood estima¬ 
tor of p. 

(c) f(E aWa <)f , -P/iP , , where A = E Q z Q z^ and the ma:\rnum likeli¬ 
hood estimator of is N % ^ x a x f a - r(E cr '<)f , : L n x a x\ - 

(d) f and 1 are independently distributed. 

(e) r is normally distributed with mean F and the covariance matrix of the 

ith and jth rows of f is (/ £4/T)" 1 = o-^H 1 ~ l A~ l H' ] . 

The proof is left to the reader. 

An estimator F is a linear estimator of if E^ = ,/^x a . It is a linear 
unbiased estimator of (3 ig if 

N /V fsj p q 

(18) P lfl = ^ JL fa x a = 12 /«pz u = JL JL H f；„ 

c\ - I C\- 1 rt = 1 )-\ it- \ 

is an identity in p，that is, if 

N . 

(19) L f )a Z ha^ ^ 卜 】 _， “ 

a= I 

= 0 ， otherwise. 

A linear unbiased estimator is best if it has minimum variance over all linear 
unbiased estimators; that is, if S\F - (3 {g ) 2 < c?(G — (3 {g ) 2 for G = 
and SG^p lg . ^ ° 
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Theorem 8.2.4, The least squares estimator is the best linear unbiased 
estimator of (3 lg . 

Proof. Let /3 tg = E^ =1 E ; p = be an arbitrary unbiased estimator of 

(3 Lgy and let fi fg — ^ =i i L q h ^ l x ta z ha a h8 be the least squares estimator, where 
A = T^a =l z a z' a . Then 

( 20 ) 



=c ^( Ag _ Pig) + 2 - A g )( Pig - Pig) + Pig - Ag) - 

Because ft s and 心 are unbiased, h 广 U 『 =1 f )a u )u ， p, g - P lg = 
^L^l = i u ,a z h a a h8 , and 

(21) a s -4= e E 

a = L / = 1 \ h = 1 / 

where S„ = 1 and 5 (; = 0, / # j. Then 

(22) 狀一〜 )(4-4) 

〜 ’ E E Z ha aHSu ,a E f)y~ 8 i) E Z h' y a h ' 8 ) U h 

a, y= 1 h = \ ; =1 \ h 1 = 1 ] 

N q p I q \ 

=E E E ^a ah ' 8 % 

a = l h = l ;=1 \ h 1 = 1 t 

= o-,^ 88 - (T u ^ E a hh M hg a h ' 8 

h = l h" = 1 

= 0 . 

Then (20) implies <^( j3 ig - fi tg ) 2 >i{^ ig ~ p ig ) z . ■ 

8*3. LIKELIHOOD RATIO CRITERIA FOR TESTING LINEAR 
HYPOTHESES ABOUT REGRESSION COEFFICIENTS 

8.3.1. Likelihood Ratio Criteria 
Suppose we partition 

(i) p-(p. p 2 ) 
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so that p 1 has q x columns and p 2 has q 2 columns. We shall derive the 
likelihood ratio criterion for testing the hypothesis 

(2) H: = 

where 的 is a given matrix. The maximum of the likelihood function L for 
the sample ， x v is 

(3) maxL = (2 tt) "^' V |i n r ^ |V e~^'\ 

P，i 


where is given by (12) or (13) of Section 8.2. 


To find the 
restricted to a; 

maximum of the likelihood function 
defined by (2) we let 

for the parameters 

⑷ 

n 


a = 1”.. ， W ， 

where 





(5) 




a = 1，…， jV ， 


is partitioned in a manner corresponding to the partitioning of p. Then y a 
can be considered as an observation from N(^ 2 z a \ X). The estimator of P 2 
is obtained by the procedure of Section 8.2 as 

( 6 ) 自 2, E = E (〜— 

or= 1 a = 1 

with C and A partitioned in the manner corresponding to the partitioning of 
P and z a , 

⑺ 

( 8 ) 

The estimator of S is given by 

(9) Nl to = E (f K 2 ))0„-K )， 

a=l 

N ^ 

= S) y^Ja P2w ^22^20) 

ar= 1 

a = I 


C = ( C 2 ) ， 

^11 义12、 

^ 21 八 22 丨 
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Thus the maximum of the likelihood function over is 


(10) maxL = (2 tt) _ ㈣ I 玄 

P2»l 

The likelihood ratio criterion for testing H is (10) divided by (3) ， namely, 


(ii) 


l^nl^ 


In testing H, one rejects the hypothesis if 入 〈入 0 , where 入 0 is a suitably 
chosen number. 

A special case of this problem led to Hotelling’s r 2 -criterion. If q - q l — l 
(q 2 ^ 0), z a = 1, a 二 1 ， … ， N, and p - p x = |ji, then the r 2 -criterion for 
testing the hypothesis (x = |x 0 is a monotonic fuiiction of (11) for pj = |x 0 . 

The hypothesis (x = 0 and the r 2 -statistic are invariant with respect to the 
transformations A"* = DX and x* = Dx a , a = 1 ， …， N ， for nonsingular D. 
Similarly, in this problem the null hypothesis P! = 0 and the likelihood ratio 
criterion for testing it are invariant with respect to nonsingular linear 
transformations. 


Theorem 8.3.1. The likelihood ratio criterion (11) for testing the null 
hypothesis Pj = 0 is invariant with respect to transformations x* = Dx a , a = 
1， • •. ， N ，for nonsingular D. 

Proof The estimators in terms of x* are 

(12) p* =DCA l =OP , 

(13) 釭=去 E (Dx a -D^z a ){Dx a -D^z a )' =D± n D', 

a = 1 

(14) ft 2w = DC 2 A^ =D^ 2ui , 

(15) E (Dx a -D^ 2w z^)(Dx a -D^ 2w z^)'=Dt w D'. U 

a = 1 


8.3.2. Geometric Interpretation 

An insight into the algebra developed here can be given in terms of a 
geometric interpretation. It will be convenient to use the following lemma: 

Lemma 8.3.L 


(16) 


白 2 w - 白 2 (i =( 白】 n - PT )-412-^22 - 



8.3 LIKELIHOOD RATIO CRITERIA FOR REGRESSION COEFFICIENTS 


301 


A 

Proof. The normal equation p n /l = C is written in partitioned form 

(17) ^ P2n-^2i ^Pin^iz ^ ^ 20 ^ 22 ) ~ (Cj,C : ). 

Thus P 2n - C 2 A .^2 - Pm ^12 ^22 • The lemma follows by comparison with 

⑹. ■ 

We can now write 

(18) 卜 pz = ([ p„z) + (p 2n - P 3 )z 2 + (P 111 -P ： )z 1 

= (^-P a Z) + (P 2tu -P 2 )Z 2 

= (X-p ft Z) + (P 2 (ll -p 2 ) Z 2 
+ (Pin ~ Pi )(Zj ~- 4)2 ^22 ^2) 

as an identity; here X = (jc,, ..., x N ), Z, = (z^,..., z ( ^), and Z 2 = 
(z( 2) ,.. ■, z^)- The rows of Z = (Z\, Z' 2 )' span a g-dimensional subspace in 
N-space. Each row of pz is a vector in the g-space, and hence each row of 
X-^Zh a vector from a vector in the g-space to the corresponding row 
vector of X. Each row vector of X — pz ts expressed above as the sum of 
three row vectors. The first matrix on the right of (18) has as its fth row a 
vector orthogonal to the 9 -space and leading to the fth row vector of X (as 
shown in the preceding section). The row vectors of (P 2<u - p 2 )Z 2 are vectors 
in the g 2 -space spanned by the rows of Z 2 (since they are linear combinations 
of the rows of Z 2 ). The row vectors of (P m _PtXZ 】 1 。％) are 
vectors in the g r space of Z 】 - A 12 A^ 2 1 Z 2 , and this space is in the qf-space of 
Z, but orthogonal to the g 2 -space of Z 2 [since (Z 】一 zl I 2 / 4 ^ 2 】 Z 2 )Z ' 2 = 0]. Thus 
each row of Jf — PZ is indicated in Figure 8.1 as the sum of three orthogonal 
vectors: one vector is in the space orthogonal to Z, one is in the space of Z 2 . 
and one is in the subspace of Z that is orthogonal to Z 2 . 
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From the orthogonality relations we have 

(19) (X-PZ)([PZ )， 

(X- 齓 Z)([ M )， + d P 2 )Z 2 ^(P2a> -P 2 ) / 

+ ( 白 “i - Pt)(Zi -/^2 ^22^2 )(^1 / ^I2 / ^22 I ^2) ， (Piri - Pt )' 
-^ia + (P 2 a>-P 2 )^ 22 (P 2 a>-P 2 y 
+ (Pin " Pi )(^u ~ / ^i 2 ^ 22 1/ ^ 2 i)(Pin _ Pt )、 

If we subtract (p 2a> - P 2 )Z 2 from both sides of (18)，we have 

(20) x- - Ptz ( - p 2 w z 2 =(x~ P n z) + (P U1 -Pt)(z 1 ~a 12 a^ 2 'z 2 ). 

From this we obtain 


(21) P 2a> Z 2 )(X-P^i- P2cZ 2 ) # 

= (X-^Z)(X~^zy 

+ (Pm ~ Pi )(^1 — ^I2^22 l ^ 2)(^1 ~-^12 / ^22 1 ^2) (Pin - Pt )’ 

= + (P ia - Pt)(/4 U -/4 I 2 A^/4 2l )(Pia ~ Pf )’• 

The determinant |S n | = (1/^)|(^ - ^Z) x X - PqZ)'! is proportior.al 
to the volume squared of the parallelotope spanned by the row vectors of 
X - p a Z (translated to the origin). The determinant |tj =(l//VOI(X — 
—P 2ft> Z 2 X^~ PtZj - P 2a> Z 2 ) , l is proportional to the volume squared 
of the parallelotope spanned by the row vectors of X — Ptz.-P 2 ^2 (trans¬ 
lated to the origin); each of these vectors is the part of the vector of 
{ that is orthogonal 【0 Z 2 . Thus the test based on the likelihood ratio 
criterion depends on the ratio of volumes of parallelotopes. One parallelo¬ 
tope involves vectors orthogonal to Z, and the other involves vectors orthogo¬ 
nal to Z 2 . 

From (15) we see that the density of ,x N can be written as 

( 22) 卜玄 + d -P2)422(fe ， P2), 

+ ( 白 lo> - P* )(^11 - / ^12 / ^22 / ^2l)(Pin - Pt )’]})• 

Thus ， X ， P in , and P 2ft> form a sufficient set of statistics for 2 ， P】，and P 2 . 
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Wilks (1932) first gave the likelihood ratio criterion for testing the equality 
of mean vectors from several populations (Section 8,8). Wilks (1934) and 
Bartlett (1934) extended its use to regression coefficients. 


8.33 - The Canonical Form 

In studying the distributions of criteria it will be convenient to put the 
distribution of the observations in canonical form. This amounts to picking a 
coordinate system in the ^/-dimensional space so that the first q { coordinate 
axes are in the space of Z that is orthogonal to Z 2 , the next q 2 coordinate 
axes are in the space of Z 2 , and the last n (= N - q) coordinate axes are 
orthogonal to the Z-space. 

Let P 2 be a X q 2 matrix such that • 

(23) / = P 2 ^ 22 /-=(P 2 Z 2 )(P 2 Z 2 )\ 
and let be a q v X q, matrix such that (A,, 2 = 

(24) I —P[ A.[1.2^1 = [ 尸 - i4 12 4 22 1 Z 2 ) | [/— A ]2 -4 22 ^ 2 )] 1 
Then define the NxN orthogonal matrix Q as 






(P】(Zi ^ ^ 12 ^- 22 ^ 2 ) 

(25) 

0 = 

Q2 

= 






(03 


where Q 3 is any n 乂 N matrix making Q orthogonal. Then the columns of 

(26) W=(W { W 2 W 3 )^XQ r =X(Q[ Q 2 

are independently normally distributed with covariance matrix X (Theorem 
3-3.1). Then 

(27) ^ = ^XQ\ = (PA + P 2 Z 2 )(Z 1 - A l2 A^Z 2 yP\ 

= Pi^n 2^1 = PiA — 1 ， 

(28) P 2 Z 2 )Z^ 

= Ol^I 2 + P 2 ^22)^2 ^ 

(29) ㈣ = SXQ^ = PZ03 = 0. 
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Let 

(30) r i = ( 7 1 ,..., 7 9i ) = P 1 ^ 1I . 2 / > ； = 

(31) r z = ( 7 9l + l > • * ■ > 7g) = (Pl -^12 P 2 -^ 22 ) ^*2 > 

(32) W=[W { W 2 W 3 ) = (w,,...,w 9i , w 9) + w 9 ,w 9 + 1 ,...,w n ). 

Then are independently normally distributed with covariance 

matrix 2 and Sw a = y a9 a = 1 ， • • • ， g，and <^w a = 0, a = q + 1 ， … ， N. 

The hypothesis Pi = 的 can be transformed to Pj — 0 by subtraction, that 
is, by letting x a — =y a1 as in Section 8.3.1. In canonical form then, the 
hypothesis is T l — 0. We can study problems in the canonical form，if we 
wish, and transform solutions back to terms of X and Z. 

In (17)，which is the partitioned form of p a /l = C, eliminate P 2a to obtain 

(33) Pm(^u ~^ 12 ^ 22 ^ 21 ) = C、- C 2 i4 22^21 

= _ ^2^72^21) 

that is, W x = p ia i4 M . 2 JP[ - 1 and T x = Pi 尸 Similarly, from ⑹ we 

obtain 

(34) ? 2^22 + = = W 2 P^ 1 ; 

that is, W 2 = + PTA) 户卜 ^ P 2 l + K A n P 2 l and r 2 = ^ 2 P 2 X + 


8.4, THE DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 
WHEN THE HYPOTHESIS IS TRUE 


8A1. Characterization of the Distribution 

The likelihood ratio criterion is the ^Nth power of 


(i) 


U = \ 2/N 


lt n | __ \Nt n \ _ 

TIJ = |Nin + (P 1 n-Pt)^ii.2(Pia-PTj 7 i ， 


where i4 u , 2 云 U 21 . We shall study the distribution and the 

moments of U when p, = p ： It has been shown in Section 8.2 that is 

distributed according to W(X, n\ where n = N - q y and the elements of 

A A 

P n — P have a joint normal distribution independent of 
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From (33) of Section 8.3, we have 


(2) (Pm- PT)^n- 2 (Pm_ Pt)' = (% - - 「V 

by (24、of Section 8.3; the columns of W { — T, are independently distributed, 
each according to N(0, 2), 

Lemma 8.4.L (Pm ~ PT^n^^Pin ~ is distributed according to 
Lemma 8.4.2. The criterion U has the distribution of 


(3) 


J 句 

_ |G +H| ' 


where G is disiribuied according to H is distributed according to 

WiX^mX where m — q ]7 and G and H Qre independent, 

)jtt 

(4) G=N± n =XX'-XZ'(ZZ'y ] ZX\ 

(5) G+H = Ni u + (p 1H -P ： )^ U .,(P 1£1 -PT )， 

= ^ -yz ； (z 2 z ； )" 1 z 2 r, 

where K = ^ - p*Z, =^- (p^ 0)Z Then 


(6) G = YY' -YZ'(ZZ'y l ZY'. 

We shall denote this criterion as U p m where p is the dimensionality, 
m = q x is the lumber of columns of p lv and n = N — q is the number of 
degrees of freedom of G. 

We now proceed to characterize the distribution of f/ as the product of 
beta variables (Section 5.2). Write the criterion U as 

( 7 ) 


where I / 1 =g u /(g u +h u ), 

( 8 ) ⑽ 


.• 、 P ， 
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and G, and H t are the submatrices of G and H, respectively, of the first i 
rows and columns. Correspondingly，let consist of the first i components 
ofy a =x a -^：K a = 1, ••” N. We shall show that V t is the length squared 
of the vector from 少 ,* = ( 少小 … ， 兄 "） to its projection on Z and Y i _ l = 
(J(i 卜 l) ，... ， ^ -1) ) divided by the length squared of the vector from yf to its 
projection on Z 2 and 

Lemma 8.4.3. Let y be an Ncomponent row vector and U an rXN matrix. 
Then the sum of squares of the residuals of y from its regression on U is 


( 9 ) 


yy f 

Uy f UU 1 
\UU f \ 


Proof. By Corollary A.3.1 of the Appendix, (9) is yy r — yU t {UU ， Y { Uy\ 
which is the sum of squares of residuals as indicated in (13) of Section 8.2* 

■ 

Lemma 8.4,4. V t defined by (B) is the ratio of the sum of squares of the 
residuals of y a , .. . ，兄 ;v from their regression on .. • ， : ^ 一 1 ) and Z to the 

sum of squares of residuals ofy n ， … ， y lN from their regression ony[ l ^ [ \ …， 
and Z 2 . 

Proof. The numerator of V x can be written [from (13) of Section 8.2] 


ig,i _ \Yx-y^ i {zz r y x zY；\ 


y,y; 

y t 2' / 

ZY ； 

ZZ' / 



r,- t z r 

j\ZZ'\ 

ZY；., 

ZZ ， 

J 


y,yr 

W 

yry;^ 

y*y*' 

yT^' 

zr；^ 

lyV 

ZZ' 


ZY；. : ZZ' I 
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y?yt 

i z 

r 

yf 

^f[ 

z 

[Y；^ Z'\ 


Z 

[^■-1 z f ] 





Z' 




一 

ZY；_, 

zz r 

\ Z , 


yf 


by Corollary A.3,1. Application of Lemma 8.4.3 shows that the right-hand 
side of (10) is the sum of squares of the residuals of yf on Y^ { and Z. The 
denominator is evaluated similarly with Z replaced by Z 2 . ■ 

The ratio V t is the 2/N\h power of the likelihood ratio criterion for 
testing the hypothesis that the regression of y* —xf- on Z 1 is 0 (in 
the presence of regression on and Z 2 ); here is the ith row of pj. For 
i = 1, g n is the sum of squares of the residuals of y* = {y lu • • • ， from its 
regression on Z, and g n is the sum of squares of the residuals from Z 2 . 
The ratio V x =^g n /(g n +ft n X which is approximate to test the hypothesis 
that regression of y* on Z { is 0, is distributed as Xn/^Xn + X^) ( D y Lemma 
8.4.2) and has the beta distribution (See Section 5.2，for 

example.) Thus V t has the beta density 


(ii) 


jgf + 1 - 

= r [士 (n+m + 1-0] i (n + 1 _ n _, 
一 r[K«+T-o]r(Im) 



for 0 < ^ 1 and 0 for v outside this interval. Since this distribution does not 
depend on we see that the ratio V t is independent of Y i _ l> and hence 
independent of W l9 ...,V i ^ r Then K!，• • • ， are independent. 

Theorem 8.4,1. The distribution of U defined by (3) is the distribution of the 
product n/Lih ， v^here M 1/ are independent and V t has the density (11) ‘ 

The cdf of U can be found by integrating the joint density of 
over the range 

nv «， 


(12) 
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We shall now show that for given N — q 2 the indices p and q x can be 
interchanged; that is, the distributions of ^ Pt q lf N- q2 -q ] = ^ Py m t n and of 
u qi ,p,N-q 2 ~p = U m,p,n + m~p are the same. The joint density of G and W, 
defined in Section 8.2 when X=I and Pi = 0 is 

( 13 ) 2"^7r^^- 1)/4 nf» i r[f(n + l- i)](27r)' mp • 


Let G + =J=CC' and let = CU. Then 


(14) U p 


|G| 


\CC' -CUU'C'\ 


\G+W,W；\ 
U 


\CC'\ 


\I n -UU r \ 


p 

U' r 


U' 


u 


\r m -uv\- 


the fourth and sixth equalities follow from Theorem A.3.2 of the Appendix, 
and the fifth from permutation of rows and columns. Since the Jacobian of 
Wy = CU is modelC| m = I/I the joint density of J and U is 

I rl 十 (n+m-p- 1 ) jlr/ 

- biJ --- 

、 } 2i(^-)p 7r P(p-»/4 n/Lir [i( n + m + ! _ fj ] 

\U\ r[i( n + i- t )] / ^ 


for J and I p - VV f positive definite，and 0 otherwise. Thus J and V are 
independently distributed; the density of J is the first term in (15) ， namely, 
w{J\I pf n 4 - m\ and the density of U is the second term, namely，of the form 

(16) K\I p -UU r \^P~ x) 

for I p — UU r positive definite，and 0 otherwise. Let ■_/* = U\ p* — m, m* -p 7 
and n* == n + m — p. Then the density of is 


(17) K|/ p -t /； 

for l p - positive definite，and 0 otherwise. By (14) ， \I p - U^U^l = 

\I m - and hence the density of is 

(18) K\I pt -U^U'^ (n ^ p *^\ 
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which is of the form of (16) with p replaced by /?* = m, m replaced by 
m* = p 7 and n - p — l replaced by n* — p* — 1 — n — p - l. Finally we note 
that U 卜 m ，„ given by (14) is |/ m - = U m p n+m _ p . 

Theorem 8.4.2. When the hypothesis is true，the distribution of U p ^ 
is the same as that of U qi p N ^ p _ lti {i.e., that of U p m n is that of V m p n+m ,. p \ 


8A2. Moments 

Since (11) is a density and hence integrates to l t by change of notation 


(19) 


^'■(1 - w) 


b-\ 


du= B(a^b) 


_r(a)r(b) 

T(a+b) 


From this fact we see that the hih moment of V r is 

广 卜 --、 1 - 畎- 1 办 

. :「 [士 (n + i-/)+/i]r [ 去 (” + m + i - /)] 

r[ 4 - 1 ~~ / )] r[l(7i 4 - yn + 1 - /) + h\ 

Since K! ， … ， K p are independent, S\J h = nr=i We obtain 

the following theorem: 

Theorem 8A3. The hih moment of U[ifh > - \{n + 1 ~ p)] is 


(2i) n 

I«1 


p 

=n 

i-i 


r[^(n + l-Q +ft]r[^(n+m + l-Q] 

r [士 (ai +1 — /)] r[ n + w +1 一 i) + /i] 

F [⑽ H 土 l -f) +h\T[\(N-q 2 + \-i)} 
r[^(iV- g, - g 2 + 1 - i)]r[i(/V -g 2 + 1 - /) + ft 


In the first expression p can be replaced by m ，m by p, and n by 
n + m — p. 

Suppose p is even, that is, p = 2r. We use the duplication formula 

(22) r(a+ l)r( a + l) = • 
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Then the 她 moment of V 2r m n is 
(23) 


吨 . , n ." = 


n 


r[j(m + n + 2) -)] r[^(m + n + 1) -j] 

r[4(m + n + 2) - j + /?] r[^(m +/i + 1) -j +h 


r [士 (m + 2 ) -j+ ^]r[^(M + 1 ) - j + h. 

~~ rTI(» + 2)-/]rn(« + i)"^71 ~~ 


r(m+M + l-2j)r(M + l-2y + 2h) \ 
r(m +；2 — 1 - 2j + 2/ ： )r(n + 1 — 2j) j 

It is clear from the definition of the beta function that (23) Is 



㈣ 广 

-h^'-Auyj 2 


where the are independent and Yj has density p(y,n + 1 - 2j, m)‘ 
Suppose p is odd ； that is, p = 2s + 1. Then 


(25) 


^2s + l . m ,n = ^ nZ t 2 Z s+l , 


where the Z, are independent and Z { has density /3(z;n + 1 - 2i,m) for 
i = 1_, s and Z i + 1 is distributed with density (3[z\(n + 1 — p)/2,m/2]. 

Theorem 8.4.4. U 2r m n is distributed as where Y [yt ..,Y r are 

independent and Y t has density /3( 少 ； m + 1 — 2i y m); U 2s + l m n is distributed as 
n; s !Z ; 2 Z S + 1 , where the Z n i = 1 ， ... ， 5 ，are independent and Z l has density 
(i(z ： // + 1 - 2i % m\ and Z s+l is independently distributed with density p[z\^(n 

l — p).4//1 ]. 


8.4.3. Some Special Distributions 
P^ 1 

From the preceding characterization we see that the density of U x „ J>n is 


(26) 


r[H«+m)] , n _ 
r(i«)r(^m) 


'(i- 




m — 



8.4 DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 


311 


Another way of writing 
(27) U Umill 


IS 


1 


l + En^Vg,, l + (m/n)F m> / 
where g n is the one element of G = and F m n is an F-statistic, Thus 


(28) 




n 

m 




Theorem 8*4.5. The distribution of ^ V i m n )/U l m n \ t n/m is the 
F-distribution with m and n degrees of freedom.，the distribution of 
[(1 一 + 1 - p)/p is the F-distribution with p and n + l — p 

degrees of freedom t 


_ 

From Theorem 8*4.4, we see that the density of yjlf 2 : m : n is 


(29) 


r(n + m — 1) 
Y(n - i)r(m) 




and thus the density of U 2j n is 


(30) 


T(n + m — 1) 

2r(n — 1) r(m) 


”(1 -‘广 


From (29) it follows that 


an v - 一 〆 • -，• ___ = p 

V J1 / m 12r71y2 ^ n ~^' 

Theorem 8,4.6, The distribution of [(1 — yjU 2 、 mn )/ yjU 2 、 m ， n ].(n — l)/m 
is the F-distribution with 2m and 2{n — 1) degrees of freedom; the distribution 
of [(1 — yjh n )/ ]/U p ^ 2i „]^ n +1 一 P、/P 匕 如 F -distribution with 2 p and 
2(n + l — p) degrees of freedom. 


p Even 

Wald and Brookner (1941) gave a method for finding the distribution of 
U p : m ， n for p ox m even. We shall present the method of Schatzoff (1966a), It 
will be convenient first to consider V p ^ m%n for m = 2r. We can write the event 
W^yV^u as 


(32) 


Y x + •■- +Y p > - log u, 
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where Y v ...,Y p are independent and Y t = -log^ has the density 


(33) =K, E ( —1) J 


/-0 


+ 1 )+}]y 


for 0 <> < oo and 0 otherwise, and 


㈣ ^ 


The joint density of is then a linear combination of terms 

expt — E/Lifl, 乂 ]‘ The density of W i = can be obtained inductively from 

the density of Wj_ x = and Y” j = 2,… ， p, which is a linear combina¬ 

tion of terms WyL, e cw ^ l+a ^ y K The density of Wj consists of linear combina¬ 
tions of 


,k^\ 


(35) e a ，t fe (c ^ )w dw = e a ，/. 


if aj ― c, 




+ ( —l ) fc+1 


k\ 


i c ~ a i) 


k+\ 


if a. # c. 


The evaluation involves integration by parts. 

Theorem 8.4.7. If p is even or if m is even，the density of U p m n can be 
expressed as a linear combination of terms ( — log u) k u ! , where k is an integer and 
l is a half integer. 

From (35) we see that the cumulative distribution function of — log t/ is a 
linear combination of terms w k e^ lw and hence the cumulative distribution 
function of f/ is a linear combination of terms ( — log u) k u\ The values of k 
and / and the coefficients depend on p y m, and n. They can be obtained by 
inductively carrying out the procedure leading to Theorem 8.4.7. Pillai and 
Gupta (1969) used Theorem 8.4.3 for obtaining distributions. 
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An alternative approach is to use Theorem 8.4.4. The complement to the 
cumulative distribution function ,, is 

(36) 


Pr {"2', n ， n u} =Pr{ n^> 

./; … f l 奸 fl 口 (兄1» + 办办:办 丨_ 

- - I 1 








In the density, (1 一 ) v 广 1 can be expanded by the binomial theorem. Then 
all integrations are expressed as integrations of powers of the variables. 

As an example，consider r- 2. The density of Y x and Y z is 


( 37 ) c > r 2 少 r 4 ( ii 广 】（ i - 少 2 广】 


[m -1 

c E 

U_0 


[(m-i)ri 2 (-ir 

(m 一 i — 1)!(w 一 j — 1) \ilj\ 


yr 2 +, y n 2 




where 


(38) 


c r(« + m 〜 i)r(« + m 〜 3) 
= T(n-l)T(n-3)T 2 (m) 


The complement to the cdf of U A min is 


⑽靡 > u )-cV 

(39) Pr{fy 4 m；n _u} (m - l)!(m-;- 1)!/!/! 

■f 1 f 1 yr 2 +, yr i+J dy 2 d yi 

^ ^y\ 


C E 


[(m - 1) 卞 （ —l) 1 


，•户 o (w — / 一 1) !(m — / ~ 1) !i! 尸 .(w — 3 + j) 

f l [ 少 ? _2+, -u" {n ^ +}) y\ +l '' } ]dy [ . 


The last step of the integration yields powers of yfu. and products of powers 
of \[u and log u (for 1 + / - ; = 一 1). 



314 


TESTING THE GENERAL LINEAR HYPOTHESIS ； MANOVA 


Particular Values 

Wilks (1935^ gives explicitly the distributions of U for p = U p = 2, p = 3 
with m = 3; p with m = 4; and p = 4 with m = 4. Wilks’s formula for 
I? - 3 with m ; 4 appears to be incorrect; see the first edition of this book. 
Consul (1966) gives many distributions for special uses. Sec also Mathai 
(1971). 


8.4.4. The Likelihood Ratio Procedure 

Let ^ n (a) be the a significance point for U p m fl ) that is, 


(40) Vr{U ptni n <u p m fl (a)\H true} = a. 

It is shown in Section 8.5 that - [n — 士 （/? - m + 1)] ^ogU p m n has a limiting 
^^distribution with pm degrees of freedom. Let A^ 2 w (a) denote the a 
significance point of and let 


(41) 


^p r rn . n -p+ I ( a ) 


— } <{p-tn+ 1)] log u fKmin (<x) 
XprnC ^ ) 


Table B.l [from Pearson and Hartley (1972)] gives value of C p m M {a) for 
a = 0.10 and 0.05. p = 1(1)10. various even values of m, and M = n — p + l 
=1 ⑴ 10(2)20.24,30,40.60,120. 

To test a null hypothesis one computes U p m ' h and rejects the null 
hypothesis at significance level a if 

(42) - [n - {{p-m + 1)] log a )‘（ a ). 

Since m n (a) > 1, the hypothesis is accepted if the left-hand side of (42) is 
less than ^- /n (a). 

The purpose of tabulating C p m M {a) is that linear interpolation is reason¬ 
ably accurate because the entries decrease monotonically and smoothly to 1 
as M increases. Schatzoff (1966a) has recommended interpolation for odd p 
by using adjacent even values of p and displays some examples. The table 
also indicates how accurate the ^^approximation is. The table has been 
extended by Pillai and Gupta (1%9). 


8.4.5. A Step-down Procedure 

The criterion U has been expressed in (7) as the product of independent beta 
variables V { , l^ 2 , .. V p . The ratio V { is a least squares criterion for testing the 
null hypothesis that in the regression of xf - p* Zj on Z = = (Z ； Z' 2 y and 
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the coefficient of is 0. The null hypothesis that the regression of X 
on Zj is Pt, which is equivalent to the hypothesis that the regression of 
X — Ptz, on Z, is 0, is composed of the hypotheses that the regression 
of xf - p*jZj on Z, is 0, / = 1 ， … ， p. Hence the null hypothesis p, = can 
be tested by use of K h ..., 

Since V { has the beta density (11) under the hypothesis = pf l5 



has the i 7 -distribution with m and n — / + 1 degrees of freedom. The step- 
down testing procedure is to compare (43) for i — 1 with the significance 
point F m n {e x )\ if (43) for £ = 1 is larger, reject the null hypothesis that the 
regression of xf - on Z, is 0 and hence reject the null hypothesis that 
P 】 = 的 . If this first component null hypothesis is accepted, compare (43) for 
/ = 2 with F mtn _ l (e 2 \ In sequence, the component null hypotheses are 
tested. If one fs rejected, the sequence is stopped and the hypothesis P 1 = 的 
is rejected. If all component null hypotheses are accepted, the composite 
hypothesis is accepted. When the hypothesis Pj = is true，the probability 
of accepting it is nf = 】（l 一 e f ). Hence the significance level of the step-down 
test is 1 一 n/Li(l — Q). 

In the step-down procedure the investigator usually has a choice of the 
ordering of the variables 卞 (i.e., the numbering of the components of X) and a 
selection of component significance levels. It seems reasonable to order the 
variables in descending order of importance. The choice of significance levels 
will affect the power. If e ； is a very small number, it will take a correspond¬ 
ingly large deviation from the ith null hypothesis to lead to rejection. In the 
absence of any other reason, the component significance levels can be taken 
equal. This procedure, of course，is not invariant with respect to linear 
transformation of the dependent vector variable. However，before carrying 
out a step-down procedure, a linear transformation can be used to determine 
the p variables. 

The factors can be grouped. For example, group into one set 

and into another set. Then U k m n = can be used to test 

the null hypothesis that the first k rows of P】are the first k rows of 的 . 
Subsequently nf =fc+1 K, is used to test the hypothesis that the last p — k lows 
ofp 】 are those of Pt ； this latter criterion has the distribution under the null 
hypothesis of U p _ k ， m ， n _ k . 

+ In some cases the ordering of variables may be imposed; for example, x x might be an 
observation at the first time point, x 2 at the second time point, and so on. 
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The investigator may test the null hypothesis = p* by the likelihood 
ratio procedure. If the hypothesis is rejected, he may look at the factors 
V ly ... 9 V p to try to determine which rows of might be different from 
The factors can also be used to obtain confidence regions for ， … ， P pl . 
Let be defined by 


(44) 


一 u i( e i) n — i 1 




m 




Then a confidence region for of confidence 1 — fs 


(45) 



y-* t 

Zx*' 

x*Z' 

n V 
ZX；_J zz , 



- 艮】 Z 】)， 




K—'X’ 卜 ' 

n 

z 2 (x*-^ n z 1 y 


Z 2 Z， 2 





Z 2 X ' 2 


V 


ZZ' 




8.5. AN ASYMPTOTIC EXPANSION OF THE DISTRIBUTION 
OF THE LIKELIHOOD RATIO CRITERION 


8.5.1- General Theory of Asymptotic Expansions 

In this section we develop a large-sample distribution theory for the criterion 
studied in this chapter. First we develop a general asymptotic expansion of 
the distribution of a random variable whose moments are certain functions of 
gamma functions [Box (1949)]. Then we apply it to the case of the likelihood 
ratio criteiion for the linear hypothesis. 

We consider a random variable W (0 < IV < 1) with hth moment^ 


(i) 


SW h =^K 


n ； =1 y J M /, n^ 1 r[x fc (i+/ I ) + g fc ] 

nf =l r [ 乃 (i+/t) + Tj ; ] ’ 


/i = 0,1，…， 


'[n all cases where we apply this result，the parameters x k 、 y r and 邛 will be such that there 
is a distribution with such moments* 
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where K is a constant such that = 1 and 

( 2 ) E Hy r 

k - 1 ;»= l 

It will be obsei^ed that the hth moment of A =is of this form 
where \ = \N=y r $ k = {(-q + l - k), = j(~q 2 + 1 a =-b = p, We 

treat a more general case here because applications later in this book require 
it. 

If we let 

(3) M — — 2 log W, 
the characteristic function of pM (0 <p < 1) is 

(4) ^Se llpM 
= SW~ 2ltp 

n^,^r 2,,p n^,r[^(i-2 f r P ) + ^] 

\^U^ x k k J n b )=l T[y ] (i-2ii P ) + 7 1) } - 

Here p is arbitrary; later it will depend on N. If a -b, x k = y k , ^ < rj A , then 
(1) is the hth moment of the product of powers of variables with beta 
distributions, and then (1) holds for all h for which the gamma functions 
exist. In this case (4) is valid for all real f. We shall assume here that (4) holds 
for all real f，and in each case where we apply the result we shall verify this 
assumption. 

Let 

(5) $(f) = log</>(f) =g(f) -g(0), 
where 

1 a b 

g(t) = 2it P E 々 iog\-Eulogy, 

U-i j-i 

a 

,+ D iogr[p 〜 (i-2/f) + 札 + U 

k - 1 
b 

- E lo g r [ py } (l-2it) + + T 7 ; ], 

where /3 女 =(1 一 p)x k and 〜 =(1 — p)y,. The form g-(f) 一 g(0) makes $(0)= 
0, which agrees with the fact that K is such that ^>(0) = 1. We make use of an 
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expansion formula for the gamma function [Barnes (1899) ， p. 64] which is 
asymptotic in x for bounded h: 


log V(x h) - logV27r + (x + — 士 ）log x — x 

r(r + +% r +K X) ’ 


'vhere + R m + i {x)= 0(.v~ (r?, + n ) and B r {h) is the Bernoulli polynomial of 
degree r and order unity defined by ^ 

丁 〆 it ^ r 

⑺ = L JT B r( h )- 

r «=0 * 

The first three polynomials are = 1] 

B i( h ) 

(8) B,(h)-h 2 ~h + i ， 

B 3 (/i) = /i 3 ~ \hr + jh. 

Taking x = px k {l - 2/7), py,(l — 2/7) and h = p k + ej + i\ } in turn, we 
obtain 

(9) <E»(0=Q-g(0)-i/log(l-2/ f ) 

m ah 

+ 1> 乂卜 2“ 厂 + : CO(x;( m+1 ))+ L0{yj^), 

r-L k=l / =1 

where 


( 10 ) 

( 11 ) 

( 12 ) 


卜 I> 厂心-叫， 

_ 卜 l) r + 1 I ^r + \( Pk ^ ^k) _ 見 +1( \ 

Wr r ( r+1 )\k (px k Y j {pyjY I 
Q= {(a ~b) log27r- y log p 

+ + ^k~ i ) lo 8 \ — 二 U + ' - v)log^. 

I. 


R m .i(x) - 0(x^ {m ^ l) ) means |^ m + 1 /? w+ j(a ：)1 is bounded as [x\ -><». 

J This definition differs slightly from that of Whittaker and Watson [(1943) ， p. 126], who expand 
t(c Ht - 1)/(〆 -1). If B*(h) is this .second type of polynomial, B^(h) = B*(h) - y t B 2r (h)- 
B* r {h) h- (- l) r+1 B r . where B r is the rth Bernoulli number, and 
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One resulting form for <f>(t) (which we shall not use here) is 

■ m 

(13) 〜鄭 ) = 2it)~ u + R* m + l , 

where ^ ;m0 a u z~ u is the sum of the first m + 1 terms in the series expansion 
of exp ( — and +i is a remainder term. Alternatively, 

m 

(U) 中⑴ = - 士 / log(l - 2/0 + Yi w r [(l - lit) r - l] +^ +1 , 

r- 1 

where 

(15) R' m+l - E 斗产 + 1) ) + Eo(d 

k i 

In (14) we have expanded g(0) in the same way we expanded g(r) and have 
collected similar terms. 

Then 

(16) <#>(/) 

, m m • 

= (l-2/r) ^exp Z^r(l-2ity r - L<o r +R' m + l 

m r^l r=l . 

= (1 一 2") J[ 1 1 + <j) r { \ —2//) + 'jy 0> 严 （ 1 一 2") 2 r … 

X n (1 - % + 士 % 2 - … 

=(1-2")_ V [1 + 乃(0+ 了 2 (0 + … +4(0+/ c +1 ]， 

where T r {t) is the term in the expansion with terms wp … o) s /, Lis ( = r; for 
example, 

(17) —2 红广一 1 ]， 

(18) 了 2(0 = w 2[(l - 2") 2 - lj + [(1 - 2") 2 — 2(1 — 2//) i + 1] • 

In most applications, we will have x k ^c k 6 and dj0, where c k and d } 
will be constant and 6 will vaiy (i.e., will grow with the sample size). In this 
case if p is chosen so (1 - p)x k and (1 一 p)_y"have limits, then R w mJrl is 
+ We collect in (16) all terms … wf，= r, because these 
terms are 0(6~ r X 
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It will be observed that T r {t) is a polynomial of degree r in (1 — 广 1 and 
each term of (1 - 2/0" ^T r {t) is a constant times (1 - 2")’ ^ for an integral 
u. We know that (1 —2itY ^ is the characteristic function of the /density 
with v degrees ol' freedom; that is, 


( 19 ) …卜 

=/1 士 (卜 2 ")—〜 _,, :也 

Let 


( 20 ) 


-00 1 

SM = / ^2^(1- 2" 厂 ’7 ； (0n 

/ OO 1 I , 

_ 00 厶 ,, 


Then the density of pM is 

00 1 ⑺ 

( 21 ) / J ： S r (z)+Rl hl 

= Sf{z) + 叫 [g"2 ㈡⑺] 

+ { ⑴ 2[§/ +4 (Z) -gf{z)] 

CO? r 

+ [g/+ 4 (z) - 2g f+z (z) +g / (z)] 

+ +S m (z)+^ + 1 . 

Let 


( 22 ) 


U r (z o) = f\(z )dz , 




IV 

m + 


dz. 


The cdf of M is written in terms of the cdf of pM, which is the integral of 
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the density, namely, 

(23) Pr{M<M 0 } 

=Pr{ pM <, pM v ) 

=H U 人 pM 0 ) + /C + l 

r = 0 

=Pr)^ 2 < P M 0 ) + w 0 (pr{ X/ + 2 ^ pM„) - Pr{ xl ^ P^o}) 

2 

+ w 2 (Pr{ 々 2 +4 S pM 0 ) - Pr j Xf ^ pM 0 )) + 亨 (Pr{ d S pM t1 ] 

-2Pr{Af/ + > ^ pM 0 ) + Pr( xf ^ pA/ 0 )) 

+ … +%( pM^+R：,,,. 

The remainder /?^ +I is O(0 M(m+l1 ); this last statement can be verified by 
following the remainder terms along. (In fact, to make the proof rigorous one 
needs to verify that each remainder is of the proper order in a uniform 
sense.) 

In many cases it is desirable to choose p so that co x = 0. In such a case 
using only the first term of (23) gives an error of order 

Further details of the expansion can be found in Box’s paper (1949). 

Theorem 8.5 丄 Suppose that SW h is given by (1) for all purely imaginary 
/i 1 with (2) holding. Then the cdf of —2p log W is given by (23). The error, 
is O(0 _(m+l) ) if x k ^c k e, yj^djd (c k > 0, 义 >0 久 ^nd if (l - p)x k> 
(1 一 p)y } have limits, where p may depend on 6. 

Box also considers approximating the distribution of 一 2 p log IF by an 
F-distribution. He finds that the error in this approximation can be made to 
be of order 0_ 3 . 


8.5.2. Asymptotic Distribution of the Likelihood Ratio Criterion 

We now apply Theorem 8,5^1 to the distribution of —2 log A, the likelihood 
ratio criterion developed in Section 83. We let The hth moment of A 

is 


(24) 


^ ^^(N-q+l-k + Mh)} 
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and this holds for all h for which the gamma functions exist, including purely 
imaginary h. We let a = b = p, 


(切 >, 广⑼， n 广！ （ i 2 + i-/), 


We observe that 


勺 = 奴 1 - p) N. 


(26) f ： jU[(i-p)^-g + i-^]K-l-[(i-p)^-g + i-^] 

k = i . 




u[(i — p) &+ 1 — 灸 ] r _ —p) wh+ 1 — 左 ] 


2 A ~2[(l - p) M - q 2 + I ~ k]q { + 丄 ^ 

.———— 


4 


~T 


PQi 

2pN 


[— 2(1 — ^)/^+ 2 ^ 2~2 + (/? + 1)+<?[+2 . 


To make this zero, we require that 

07 、 N ~ < ii- X z(P +( ix + !) 

(27) p - • 

Then 

(28) Pr|-2^1og A<zj 

= Pr{-/dogf/ Pil?|iN _ c <z) 

= Pr UA， 2 } 

+ ^H Pr UA,“ 4 - p「UV “I) 

+ ^ W Pr W 一 < z )- Pr {^,< z }) 

- ♦r(4 i+4 - Pr(x p 2 q , <^})] +RI 
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where 

(29) k = pN= N~q 2 -\(p+q l + l)^n~\(p-q l + l), 

nm P9 i(p 2 + 9?-5) 

( 30 ) 72 = - 48 - ’ 

(31) % = 夸 + ig20 + 3<?f + 10p 2 <?, 2 - 50(p 2 + <??) + 159]. 

Since A = U》 N q ', n , where n = N-q, (28) gives Pr{ ~k log U p ^ n <z). 

Theorem 8.5.2. The cdf of 一 k\ogU p q ' n is given by (28) with k-n 
一 1( 尸 — 9i + l )， an( ^ 72 an< ^ % S^ Ven (30) and (31 )， respectively. The 
remainder term is 0( A/— 6 ). 

The coefficient k — n—\{p-q { + l) is known as the Bartlett correction. 
If the first term of (28) is used，the error is of the order N " 2 ； if the second ， 
bT' and if the third, f jV— 6 . The second term Is always negative and is 
numerically maximum for z= \j{pq x + 2)(p^j) ( = pq l + 1, approximately). 
For p>3,q l >3, we have y 2 //c 2 < [(p 2 + q^)/k] 2 /96, and the contribution 
of the second term lies between -0.005[(/? 2 + q^)/k] 2 and 0. For /? > 3, 
q x > 3, we have y 4 < y\, and the contribution of the third term is numerically 
less than (y 2 //c 2 ) 2 . A rough rule that may be followed is that use of the first 
term is accurate to three decimal places if p 2 + <, /c/3. 

As an example of the calculation, consider the case of p — 3, = 6, 

N — q 2 — 24, and z = 26.0 (the 10% significance point ;^ 2 8 ). In this case 
y 2 /k 2 = 0.048 and the second term is - 0.007: y 4 /k 4 = 0.0015 and the third 
term !s 一 0.0001 ‘ Thus the probability of — 19log f/ 3l8 < 26.0 is 0.893 to 
three decimal places. 

Since 

(32) 一 [n-!(p-m + l)]logH p . m ,„(a) 

~ Cp ， m.n-p+l («)☆(«)， 

the proportional error in approximating the left-hand side by ^ p 2 m (a) is 
C p OTifl _ p+1 — 1. The proportional error increases slowly with p and m. 

8.5,3. A Normal Approximation 

Mudholkar and Trivedi (1980) ，（ 1981) developed a normal approximation to 
the distribution of — log U p m n which is asymptotic as p and/or m — oo‘ It is 
related to the Wilson—Hilferty normal approximation for the ^^distribution. 

f Box has shown that the term of order N_ s isO and gives the coefficients to be used in the term 
of order N~ 6 . 
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First, we give the background of the approximation. Suppose {y fc } is a 
sequence of nonnegative random variables such that (Y k - fJL k )/<T k ^ MO, 1) 
as 众 — 00 , where SY k ^ \x k and T(y fc ) = cr fc 2 . Suppose also that oo and 
a^//x k is bounded as 众一 ► oo. Let Z k = {Y k /fx k ) h . Then 

(33) Z fe :l ^ k {Z k ~\) 4 N ^ Q1) 

h { a k/^k) hu k 


by Theorem 42.3. The approach to normality may be accelerated by choosing 
h to make the distribution of Z k nearly symmetric as measured by its third 
cumulant. The normal distribution Is to be used as an approximation and is 
justified by its accuracy in practice. However, it will be convenient to develop 
the Ideas in terms of limits, although rigor is not necessary. 

By a Taylor expansion we express the /zth moment of Y k /ix k as 


(34) 以一 ( 会 f 
2 

, h{h - l)(/i - 2) 4 如 - 3(/i - 3 )(o> 2 /a) 2 . … _ 3 、 

十 24 ~ ~ ^2 十 J ， 

where cf> k = <f(Y k - /n k y//n k , assumed bounded. The rth moment of Z k is 
expressed by replacment of h by rh in (34). The central moments of Z k are 


(35) 

(36) 




h 2 , h 2 (h - 1) 24> k + (3h-5){a, 2 /fi k ) 




2 




+ 0(< 3 )， 


^( Z fc - l ) 3 = 炉 屯 + 3 (A /〜） 


To make the third moment approximately 0 we take h to be 


(37) 


1 d( Y k — Pk) 3 pk 

二 I - - 


Then Z k ^ (Y k //x k ) hli is treated as normally distributed with mean and 
variance given by (34) and (35) ， respectively, with h = h Qt 
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Now we consider -\ogU p m n = — logK„ where 匕 ，…， are inde¬ 
pendent and V { has the density j3(x;(« + 1 - 0/2, m/2 )，i 1. p. As 

n oo and m ^ oo, ― log V x tends to normality. If V has the density 
/3(x; c/2, b/2\ the moment generating function of 一 log I’ is 


(38) 


« r[(fl + b)/2]r(a/2-/) 

" Y{a/2)Y[(a+b)/2-t - 


Its logarithm is the cumulant generating function. Differentiation of the last 
yields as the rth cumulant of V 

(39) r=l,2 …•‘ 

where ^(w) —rflog T{w)/dw. [See Abramovitz and Stegun (1972 乂 p. 258, for 
example.] From T{w + 1) = H»r(n») we obtain the recursion relation i[f(w + 1) 
= 少 (vw) + 1 /h». This yields for s = 0 and / an integer 


(40) 




The validity of (40) for ^ = 1,2,... is verified by differentiation. [The expres¬ 
sion for iff'iZ) in the first line of page 223 of Mudholkar and Trivedi (1981) is 
incorrect.] Thus for b = 21 


(41) 


C r =(r-1)! L 

/ =o 


1 

(«/ 2 +；) 


From these results we obtain as the ; th cumulant of - log U p 2f „ 

(42) K r (-log" p ， 2 ,，y (r-1)! f ： 2 1 

广 o — 丨丁 1 一 

As / -> oo the series diverges for r=l and converges for r = 2,3, and hence 
K r /K l ->0, r = 2,3. The same is true as p oc (if n/p approaches a positive 
constant). 

Given n 9 /?, and /， the first three cumulants arc calculalcd from (42)、Then 
h 0 is determined from (37)，and ( —i s treated as approximately 
normally distributed with mean and variance calculated from (34) and (35) 
for h = h {) . 

Mudholkar and Trivedi (1980) calculated the error of approximation for 
significance levels of 0.01 and 0.05 for n from 4 to 66， /? = 3,7, and 
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q = 2,6,1(L The maximum error is less than 0,0007; in most cases the error is 
considerably less. The error for the 尤 ^approximation is much larger, espe¬ 
cially for small values of n. 

In case of m odd the rth cumulant can be approximated by 


(43) 2 r (r- 1)! E 


U 川 -3) 

■ E 


(» — ? -h 1 - 2/) r 2 ? + m) r 


Davis (1933.1935) gave tables of ^(w) and its derivatives. 


8,5.4 An F-Approximation 

Rao (1951) has used the expansion of Section 8.5.2 to develop an expansion 
of the distribution of another function of U JK/U ft in terms of beta distribu- 
liuns. The conMiints can be adjusted so that the term after the leading one is 
of order m ' A . A good approximation is to consider 


(44) 


l-U l/s ks-r 
f/ 1 " pm 


as F with pm and ks —r degrees of freedom, where 


(45) 




-1 


and k is n “ 士 （ p — m — 1). For - 1 or 2 or m ^ 1 or 2 the ^-distribution is 
exactly as given in Section 8.4, If ks ~ r is not an integer, interpolation 
between two integer values can be used. For smaller values of m this 
approximaiion is more accurate than the x 1 - ^pproximation. 


8.6. OTUKR CRi riiRlA FOR 1'KSTINC ； T11E LINIiAR HYPOTHESIS 
8.6.1* Functions of Roots 

Thus far the only test of the linear hypothesis we have considered is the 
likelihood ratio test. In this section we consider other test procedures. 

Let X n ， P m ，and _ 2w be the estimates of the parameters in jV(Pz ， 2 )， 
based on a sample of N observations. These are a sufficient set of statistics, 
and we shall base test procedures on them. As was shown in Section 8.3, if 
the hypothesis is p t — p* 5 one can reformulate the hypothesis as p! = 0 (by 
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replacing x a by x a - Moreover, 

( 1 ) 

=Pl(4 l) -^12^22^0 + ^2 + P^ l2 OL 2) 

= P lZ : (l) + 苈 4' 

where E a z« (l) z ( a 2), = 0 and E a z* (l) z* (l), =^ U - 2 - Then p, = p in and 芪 = 
P2.- 

We shall use the principle of invariance to reduce the set of tests to be 
considered. First, if we make the transformation X* ^X a + Tz^ 2) , we leave 
the null hypothesis invariant, since = Pi^* (l) + (P2 + r)z ( a 2) and + T 
is unspecified. The only invariants of the sufficient statistics are X and p L 
(since for each 矣 ， there is a T that transforms it to 0, that is, -S)- 
Second，the nail hypothesis is invariant under the transformation z** (l) = 
Cz* (,) (C nonsingular); the transformation carries P[ to P 【 C* 一 1 . Under this 
transformation 2 and are invariant; we consider A n . 2 as Informa¬ 

tion relevant to inference. However, these are the only invariants. For 

A A 

consider a function of p l and A w2i say /(P 【， A n . 2 )^ Then there is a C* that 
carries this into /( 白 【 C* — l ，/)，and a further orthogonal transformation 
carries this into f(T, /), where t iv = 0, i <u ， r l7 >0. (If each row of T is 
considered a vector in g r space, the rotation of coordinate axes can be done 
so the first vector is along the first coordinate axis, the second vector is in the 
plane determined by the first two coordinate axes, and so forth). But T is a 
function of 7T ; = that is, the elements of T are uniquely deter¬ 

mined by this equation and the preceding restrictions. Thus our tests will 
depend on S and 自 ’【• Let NX = G and 

Third, the null hypothesis is invariant when x a is replaced by Kx a ， for X 
and are unspecified. This transforms G to KGK* and H to KHK\ The 
only invariants of G and H under such transformations are the roots of 

(2) Iff - /G|=0. 

It is clear the roots are invariant, for 

(3) 0^\KHK t -IKGK \ 

- \K{H~lG)K l \ 

^\K\^\H-lG\^\K f \. 


On the other hand, these are the only invariants, for given G and H there is 
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a K such that KGK' =/ and 



KHK' 


0 L 


0 0 


0\ 

0 


where /【 > … l p are the roots of (2). (See Theorem A‘2‘2 of the Appendix.) 

Theorem 8.6.1. Let x a be cm observation from ⑴ + 民 zf) ， 2 )， 

where E a z* (I) z^ 2), = 0 and E a z* (I) z* (I), =A U . 2 ^ The only functions of the 
sufficient statistics and A u 2 invariant under ihe transprmations x* = jc„ + 
rz ( a 2) , = Cz* (l) , and x* = Kx a are the roots of (2), where G = NX and 

^ = Pi^u-yPr 

The likelihood ratio criterion is a function of 


|G| _ \KGK'\ _ I/| 

{ } \G + H\ ~ \KGK' +KHK \ ~ \I + L\ 

=n(i+/,)], 

/*= I 

which is clearly invariant under the transformations. 

Intuitively it would appear that good tests should reject the null hypothesis 
when the roots in some sense are large, for if p] is very different from 0, then 
will tend to be large and so will H. Some other criteria that have been 
suggested are (a) E/ 2 , (b) E/yd +/,) ，（ c) max/” and (d) min/，. In each case 
we reject the null hypothesis if the criterion ©cceeds some specified number. 


8.6.2. The Lawley - Hotelling Trace Criterion 

Let K be the matrix such that KGK f = I [G=K^ l (K r y \ or G l ^K r K] 
and so (4) holds. Then the sum of the roots can be written 


p 

(6) Y,l^trL^trKHK f 


This criterion was suggested by Lawley (1938)，Bartlett (1939)，and Hotelling 
(1947),(1951). The test procedure is to reject the hypothesis if (6) is greater 
than a constant depending on p, m, and n. 
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The general distribution 1 of trtfG— 1 cannot be characterized as easily as 
that of U p m In the case of p = 2, Hotelling (1951) obtained an explicit 
expression for the distribution of tr HG^ 1 = l { + 1 2 . A slightly different form 
of this distribution is obtained from the density of the two roots l { and L in 
Chapter 13. It is 

(7) PrftrHG-' <w) 乂 /(2+ , 0 (w - i，" - i) 


/jir [士 (m + n — 1)] 


(1 + w) ]) ~ 



where I x (a^ b) is the incomplete beta function, that is, the integral of p(y: a. b) 
from 0 to x. 

Constantine (1966) expressed the density of tr HG ~ 1 as an infinite series 
in generalized Laguerre polynomials and as an infinite series in zonal 
polynomials; these series, however, converge only for trffG’ 1 < 1. Davis 
(1968) showed that the analytic continuation of these series satisfies a system 
of linear homogeneous differential equations of order p. Davis (1970a, 
1970b) used a solution to compute tables as given in Appendix B. 

Under the null hypothesis, G is distributed as = N — q) and 

H is distributed as where the Z a and Y {) are independent, each 

with distribution iV(0,X). Since the roots are invariant under the previously 
specified linear transformation, we can choose K so that = I and let 

G* ^KGK 1 [- L(KZ lx XKZ ir Y] and ff* = KHK\ This is equivalent to assum¬ 
ing at the outset that 2= /. 

Now 


( 8 ) 


plim jjG 

N- 


plim 


n I 
n + q n 


L2 a Z' a = I. 


This result follows applying the (weak) law of large numbers to each element 
of(l/n)G ， t 

(9) Plim^ LZ la Z ]a ^^Z ia Z Ja ^8, r 

卜 00 Y = 1 

Theorem 8«6.2« Let f(H) be a function whose discontinuities form a set of 
probability zero when H is distributed as the Y t independent' each 

with distribution MO, /). Then the limiting distribution of f{NHG~ { ) is the 
distribution off(H\ 

^I-awlcy (193H) purported to derive ihc exact ilislnbulion, bul iho rcsuli K in error 
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Proof This is a straightforward application of a general theorem [for 
example, Theorem 2 of Chernoff (1956)3 to the effect that if the cdf of X n 
converges to that of X (at every continuity point of the latter) and if g(x) is 
a function whose discontinuities form a set of probability 0 according to 
the distribution of AT, then the cdf of g(X n ) converges to that of g(X). In our 
case X n consists of the components of H and G, and X consists of the 
components of H and /. ■ 

Corollary 8,6*1. The limiting distribution of N trHG^ 1 or n trHG^ 1 is the 
\ ''distribution with pq x degrees of freedom. 

This follows from Theorem 8.6.2, because 

(10) E EC 

2*1 i =S 1 I ； = l 

lto (1956) ， (1960) developed asymptotic formulas, and Fujikoshi (1973) 
extended them. Let m ^(a) be the a significance point of trHG^ lm y that is, 

(11) PrltrffG" 1 >w p m n {a)\ - a, 

and let ^/(a) be the a-significance point of the ^ 2 -distribution with k 
degrees of freedom. Then 

( 12 ) nW P .n,.n{ a ) = + P pm + 2 尤 二⑷ 

+ (p~m + l)^ 2 m (a) +0(n- 2 ). 

lto also gives the term of order n_ 2 . See also Muirhead (1970). Davis 
(1970a),(1970b) evaluated the accuiacy of the approximation (12 乂 lto also 
found 

(13) ?r{n irHG 1 ^z) ^G pm (z)-^ 〜 + 二 V z2 

+ (p-m + l)g pm (z) +0(n- 2 ), 

where G^Cz) = Pr{<z) and g A (z) = (d/dz)G k (z). Pillai (1956) suggested 
another approximation to nw p m n (a), and Pillai and Samson (1959) gave 
moments of tr HG~ 1 . Pillai and Young (1971) and Krishnaiah and Chang 
(1972) evaluated the Laplace transform of tr HG" 1 and showed how to invert 
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=tr KHK'(KGK' +KHK')~ l 
= trHK'[K(G + H)K']^K 
-=tiH(G + H)~ l , 

where as before K is such that KGK' = I and (4) holds. In terms of the roots 
fi = /,/(! + /,), / of 


( 15 ) 


\H~f(H+G)\ = 0 , 


the criterion is In principle, the cdf, density, and moments under the 

null hypothesis can be found from the density of the roots (Sec. 13.2.3), 


(16) 

where 

(17) 


P P . 

(m-p-I ) J - J (1 _ ^ 


cuf'r 


卩 u-/》’ 


c 


^ 2 T p [{(m + n)] 
r p(^ n ) T p{l m ) T p{lP) 


for 1 >/!>••_ >f p > 0, and 0 otherwise. M m- p and n-p are odd, the 
density is a polynomial in j\f p . Then the density and cdf of the sum of 
the roots are polynomials. 

Many authors have written about the moments, Laplace transforms, densi¬ 
ties, and cdfs, using various approaches. Nanda (1950) derived the distribu¬ 
tion for p = 2,3,4 and m=p + l. Pillai (1954),(1956),(I960) and Pillai and 


the transform. Khatri and Pillai (1966) suggest an approximate distribution 
based on moments. Pillai and Young (1971) suggest approximate distribu¬ 
tions based on the first three moments. 

Tables of the significance points are given by Grubbs (1954) for /? = 2 and 
by Davis (1970a) for /? = 3 and 4, Davis (1970b) for p-S y and Davis (1980) 
for p = 6(1)10; approximate significance points have been given by Pillai 
(i960). Davis’s tables are reproduced in Table B.2. 

8.6.3. The Bartlett-Nanda-Pillai Trace Criterion 

Another criterion, proposed by Bartlett (1939), Nanda (1950), and Pillai 
(1955), is 


-1 

H 

(/ 
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Mijares (1959) calculated the first four moments of V and proposed approxi¬ 
mating the distribution by a beta distribution based on the first four mo¬ 
ments. Pillai and Jayachandran (1970) show how to evaluate the moment 
generating function as a weighted sum of determinants whose elements are 
incomplete gamma functions; they derive exact densities for some special 
cases and use them for a table of significance points. Krishnaiah and Chang 
(1972) express the distributions as linear combinations of inverse Laplace 
transforms of the products of certain double integrals and further develop 
this technique for finding the distribution. Davis (1972b) showed that the 
distribution satisfies a differential equation and showed the nature of the 
solution. Khatri and Pillai (1968) obtained the (nonnull) distributions in 
series forms. The characteristic function (under the null hypothesis) was 
given by James (1964X Pillai and Jayachandran (1967) found the nonnull 
distribution for p = 2 and computed power functions. For an extensive 
bibliography see Krishnaiah (1978X 

We now turn to the asymptotic theory. It follows from Theorem 8.6.2 that 
nV or NV has a limiting ^^distribution with pm degrees of freedom. 

Let v p fU li (a) be defined by 

(18) P r {trff(H + G) — 

Then Davis (1970a), (1970b), Fujikoshi (1973), and Rothenberg ^1977) have 
shown that 

( 19 ) = ^ 2 m («) + 2^ ~ P pm+2 x i^ a ^ 

+ {p-m + l)^ 2 m («) +0(n~ 2 ), 

Since we can write (for the likelihood ratio test) 

(20) nu p _ l>in (a)=Xp„,( a ) + 士 (P- m + l )Xpn,( a ) +0(n~ 2 ), 

we have the comparison 

(21) nw p ^ n (a)=nu p<m Ja) + ^ - ^ - + + 2 1 Xp m ( a)+0(n~ 2 ), 

( 22 ) nu pmtn (a)=nu pimn (a) + ^^Sj~^Xp m (o l )+0(n- 2 ). 
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An asymptotic expansion [Muirhead (1970), Fujikoshi (1973)] is 

(23) ?r{nV<z) = G pn (z) + -p-l)G pm (z) 

+ 2(p + l)C^ 1+2 (z) - (p + 1)G ； ( „ i+4 (z)] +0(n- ： ). 
Higher-order terms are given by Muirhead and Fujikoshi. 

Tables. Pillai (1960) tabulated 1% and 5% significance points of V for 
p - 2(1)8 based on fitting Pearson curves (Le” beta distributions with ad¬ 
justed ranges) to the first four moments. Mijares (1964) extended the tables 
to p — 50. Table B.3 of some significance points of (n = 

tr(l/m)H{[l/(n +m)KG is from Concise Statistical Tables, and was 

computed on the same basis as Pillars. 、 Schuurman ， Krishnaiah, and 
Chattopodhyay (1975) gave exact significance points of V for p = 2(1)5; a 
more extensive table is in their technical report (ARL 73-0008). A compari¬ 
son of some values with those of Concise Statistical Tables (Appendix B) 
shows a maximum difference of 3 in the third decimal place. 


8.6.4. The Roy Maximum Root Criterion 

Any characteristic root of HG ~ [ can be used as a test criterion. Roy (1953) 
proposed / lT the maximum characteristic root of ffG' 1 , on the basis of his 
union-intersection principle. The test procedure is to reject the null hypoth¬ 
esis if /, is greater than a certain number, or equivalently, if f\ +/ 里） 

=R is greater than a number r p ftl fl (a) which satisfies 

( 24 ) Pr { 尺•川卜 a . 

The density of the roots f p for p <m under the null hypothesis is 

given in (16). The cdf of R =/,, Pr {/ 1 </*}, can be obtained from the joint 
density by integration over the range 0 <f p < <f l </*. If m — p and 
n -p are both odd, the density oi f p is a polynomial; then the cdf of 

f x is a polynomial in /* and the density of /, is a polynomial. The only 
difficulty in carrying out the integration is keeping track of the different 
terms. 

Roy [(1945),(1957), Appendix 9] developed a method of integration that 
results in a cdf that is a linear combination of products of univariate beta 
densities and 1 eta cdfs. The cdf of /, for p = 2 is 

(25) Pr{/, </} =I f {m - l,n - 1) 


IK 心 1 - ” 卜、， 一 1 )，“"— 1 )]. 
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This is derived in Section 13.5. Roy (1957)，Chapter 8, gives the cdfs for 
p- 3 and 4 also. 

By Theorem 8,62 the limiting distribution of the largest characteristic root 
of nHG' NHG~\ nH(H + G)- { , or + is the distribution of 

the largest characteristic root of H having the distribution The 

densities of Uic roots of H are given in Section 13.3. In principle, the 
marginal density of the largest root can be obtained from the joint density by 
integration, but in actual fact the integration is more difficult than that for 
the density of the roots of HG " 1 or + G)' 1 . 

The literature on this subject is too extensive to summarize here. Nanda 
(1948) obtained the distribution for p = 2 y 3 y 4, and 5. Pillai (1954), (1956 )， 
(1965X (1967) treated the distribution under the null hypothesis. Other 
results were obtained by Sugiyama and Fukutomi (1966) and Sugiyama 
(1967). Pillai (1967) derived an appropriate distribution as a linear combina¬ 
tion of incomplete beta functions. Davis (1972a) showed that the density of a 
single ordered root satisfies a differential equation and (1972b) derived a 
recurrence relation for it. Hayakawa (1967),-Khatri ind Pillai (1968)，Pillai 
and Sugiyama (1969), and Khatri (1972) treated tho noncentral case. See 
Krishnaiah (1978) for more references. 

Tables. Tables of the percentage points have been calculated by Nanda 
(1951) and Foster and Rees (1957) for p = 2, Foster (1957) for p = 3, Foster 
(1958) for p = 4, and Pillai (1960) for p ^ 2(1)6 on the basis of an approxima¬ 
tion. [See also Pillai (1956),(1960), (1964), (1965), (1967).] Heck (1960) pre¬ 
sented charts of the significance points for p = 2(1)6. Table B.4 of signifi¬ 
cance points of nly/m is from Concise Statistical Tables, based on the 
approximation by Pillai (1967). 

8.6.5. Comparison of Powers 

The four tests that have been given most consideration are those based on 
Wilks's C/, the Lawley-Hotelling W ^ the Bartlctt-Nanda-Pillai V y and Roy’s 
R. To guide in the choice of one of these four, we would like to compare 
power functions. The first three have been compared by Rothenberg on the 
basis of the asymptotic expansions of their distributions in the nonnull case. 

Let Vp be the roots of 

(26) |(P l -PtM ll . 2 (p l -P!) ， -d| = 0 ‘ 

The distribution of 


(27) 


t r (自 in - Pt )^ii :(Pin - Pt )’工 1 
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is the noncencral ^^distribution with pm degrees of freedom and noncen¬ 
trality parameter E/L ] As TV— 00 ， the quantity (l/n)G or (1/N)G 
approaches X with probability one. If we let N ^ 00 and A n , 2 is unbounded, 
the noncentrality parameter grows indefinitely and the power approaches 1. 
It is more informative to consider a sequence of alternatives such that the 
powers of the different tests are different. Suppose p, = is a sequence of 
matrices such that as N -> 00 , (P? - PT) 4 i- 2 (P 7 一 Pt)' approaches a limit 
and hence 〜"”.•，< approach some limiting values 〜， … ，〜 respectively. 
Then the limiting distribution of NtrHG^\ n \xHG~ l , N tr H(H + G)^ 1 , 
and n tr H(H + G)~ l is the noncentral ^^distribution with pm degrees of 
freedom and noncentrality parameter E/Li^ Similarly for — AMogt/ and 
—wlog U 9 

Rothenberg (1977) has shown under the above conditions that 


(28) Pr{U^u p ^ n (a)} = l-G pm 


Xpmi. a ) 


E 




1 [ ' 

2^-Up +m + 1) £ v t g pm+A [ xim{<^)\ 


+ E ^g pm ^[xl m {^)\ \ +0 


\n 


(29) PrltrHG-^w^ia)} 


l-G 


pm 


Xpm { 01 ) 




2^{(P +m + i )L ^S pm ^[Xp m (a)] 


r 

+ E ^g pm+6 [xp m (a)\ 


I>, 2 


p + m + 1 




vm \ K 


.^,L(«)] +°(j ； 
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(30) ?r{irH(H + G)- l >v p ^ n (a)} 
= 1 


P 


(p +m + 1) E ^8p m+ A[xp m ( a )] 

/ = 1 

P 

+ L Sprn + 6 [ Xpmi, ^ )] 


P 

E 


2 一 p +m + 1 
?i pm + 2 


p \ 

2" 

E 〜 


(，=i j 



Spm +8 [ Xpm ( ^ )] 


+ 0 


n 


where G^(x\y) is the noncentral /distribution with / degrees of freedom 
and noncentrality parameter y, and g,(x) is the (central) /‘density with / 
degrees of freedom. The leading terms are the noncentral /-distribution; 
the power functions of the three tests agree to this order. The power 
functions of the two trace tests differ from that of the likelihood ratio test by 
±& m+ 8[ XpJa)]/(2n) times 


(31) 


/=1 


p + m + 1 ( f 

p ^" +2 I ,?/ 1 


p 

E (^- 1>) 2 - 

/-I 


P( P - P + 2 ) j 2 

pm+ 2 v 5 


where Ef.j v x /p. This is positive if 


(32) 


^ /(p-l)(pT2) 

v y pm + 2 


where <r v 2 = Ef= ^ -v) 2 /p is the (population) variance of v Y the 

left-hand side of (32) is the coefficient of variation. If the v's are relatively 
variable in the sense that (32) holds, the power of the Lawley - Hotelling trace 
test is greater than that of the likelihood ratio test, which in turn is greater 
than that of the Bartlett - Nanda - Pillai trace test (to order \/n)\ if the 
inequality (32) is reversed, the ordering of power is reversed. 

The differences between the powers decrease as n increases for fixed 
VinV p 、 (However, this comparison is not very meaningful, because increas¬ 
ing n decreases Pf - Pt and increases Z’Z.) 

A number of numerical comparisons have been made. Schatzoff (1966b) 
and Olson (1974) have used Monte Carlo methods; Mikhail (1965)，Pillai and 
Jayachandran (1967)，and Lee (1971a) have used asymptotic expansions of 
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distributions. Ail of these results agree with Rothenberg’s. Among these 
three procedures，the Bartlett — Nanda-Pillai trace test is to be preferred if 
the roots are roughly equal in the alternative, and the Lawley - Hotelling 
trace is more powerful when the roots are substantially unequal. Wilks’s 
likelihood ratio test seems to come in second best; in a sense it is maximin. 

As noted in Section S,6.4j the Roy largest root has a limiting distribu¬ 
tion which is not a ^^distribution under the null hypothesis and is not a 
noncentral ^^distribution under a sequence of alternative hypotheses. Hence 
the comparison of Rothenberg cannot be extended to this case. In fact, the 
distributions ji the nonnull case are difficult to evaluate. However，the 
Monte Carlo results of Schatzoff (1966b) and Olson (1974) are clear-cut. 
The maximum root test has greatest power if the alternative is one-dimen¬ 
sional, that is, if v 2 ^ mma ^ ^ = 0. On the other hand, if the alternative is not 
one-dimensional，then the maximum root test is inferior. 

These test procedures tend to be robust. Under the null hypothesis the 
limiting distribution of Pj - p* suitably normalized is normal with mean 0 
and covariances the same as if X were normal，as long as its distribution 

A 

satisfies some condition such as bounded fourth-order moments. Then = 
(l/N)G converges with probability one. The limiting distribution of each 
criterion suitably nomalized is the same as if X were normal. Olson (1974) 
studied the robustness under departures from covariance homogeneity as 
well as departures from normality. His conclusion was that the two trace tests 
and the likelihood ratio test were rather robust, and the maximum root te^l 
least robust. See also Plllai and Hsu (1979). 

Berndt and Savin (1977) have noted that 

(33) + G) _, < logC/- 1 < trffG -1 . 

(See Problem S.19.) If the x 2 significance point is used, then a larger 
criterion may lead to rejection while a smaller one may not. 

8.7. TESTS OF HYPOTHESES ABOUT MATRICES OF REGRESSION 
COEFFICIENTS AND CONFIDENCE REGIONS 

8.7 丄 Testing Hypotheses 

Suppose we are given a set of vector observations x s with accompany¬ 
ing fixed vectors where x lt is an observation from We 

let p = (Pi p 2 ) and 4 =where Pj and have q x ( = q - q.) 
columns. The null hypothesis is 


(1) 
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where is a specified matrix. Suppose the desired significance level is a. A 
test procedure is to compute 


( 2 ) 


|N2 a l 


and compare this number with u p q ^ the a significance point of the 
t/ p ^-distribution. For = 2,..., 10 and even m, Table 1 in Appendix B can 
be used. For m = 2,..., 10 and even p the same table can be used with m 
replaced by p and p replaced by m. {M as given in the table remains 
unchanged.) For p and m both odd, interpolation between even values of 
either p or rn will give sufficient accuracy for most purposes. For reasonably 
large tL the asymptotic theory can be used. An equivalent procedure is to 
calculate Pr{(/ m /( ^ U); if this is less than a, the null hypothesis is rejected. 
Alternatively one can use the Lawley—Hotelling trace criterion 

(3) ^=tr(Ni^Ni a )(Ni a )" 1 

^ tr (Pin P*)^ii 2( 白 iq - Pi » 

the Pillai trace criterion 


(4) V=tr(N± w -N± n }(Nt w y' 

= tr(p in -pt)4 11 . 2 (p in -pt) ( (/vi (U )" ， 5 

or the Roy maximum root criterion 尺 ， where R is the maximum root of 

(5) \N± W ~N± n ~ rN± a \ = |(P in - ^)A u . 2 (^a ~ ^) 1 ~ rN = 0. 

These criteria can be referred to the appropriate tables in Appendix B. 

We outline an approach to computing the criterion. If we let y a = x a - 
then 又 can be considered as an observation from MAz a , 2), where 
A = (A 2 ) = (P! — Pt P 2 乂 Then the null hypothesis is // : A t = 0, and 

(6) Ly.y：, + ^a u ^\ 

(7) Ly a z' a -c-^(A u A u y 

Thus the problem of testing the hypothesis Pj = P* is equivalent to testing 
the hypothesis = 0, where Sy a = Hence let us suppose the problem 

A A A 

is testing the hypothesis p, = 0. Then A/2 W = T,x a x r a - and 
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A A A 

N ^fi = ^ a x' a -^ {1 A^' iy We have discussed in Section S.2.2 the computa- 
tion of p fi /ip fi and hence NX n . Then ^ 2 W A 2 2 ^ 2 W can be computed in a 
similar manner If the method is laid out as 


( 8 ) 


\ 

^22 ^21 

fP 2 n | 



^12 ^11 } 


1 



the first q 2 rows and columns of A* and of A** are the same as the result of 
applying the forward solution to the left-hand side of 

(9) A 22^2u> = 

and the first q 2 rows of C* and C** are the same as the result of applying 
the forward solution to the right-hand side of (9). Thus ^ 2w /l 22 與 w = Cf Cf * 
where C*^ = (C*^ Cf 0 and C** f = (C^ r Cf +/ X ^ 

The method implies a method for computing a determinant. In Section 
A.5 of the Appendix it is shown that the result of the forward solution is 
FA Thus |F| ^\A\ = \A*\. Since the determinant of a triangular matrix 
is the product of its diagonal elements, \F\ = 1 and \A\ = \A*\ 

This result holds for any positive definite matrix in place of A (with a suitable 
modification of F) and hence can be used to compute |A^i n | and ‘ 


8.7.2. Confidence Regions Based on U 

We have considered tests of hypotheses Pi = pt, where p* is specified. In 
the usual way we can deduce from the family of tests a confidence region for 
p ]f From the theory given before, we know that the probability is 1 - a of 
drawing a sample so that 


( 10 ) 


_I^Xnl_ 

I + (自 in — Pi )4 ii.2( 自 in — Pi) ’ 


之 1 


Thus if we make the confidence-region statement that Pi satisfies 

⑴、 __ > , 、 

|Ni; n + (p 111 -p 1 )A n . 2 (p lil -p;)-| 

where (11) is interpreted as an inequality on p l = then the probability is 
1 — a of drawing a sample such that the statement is true. 


Theorem 8.7.1. The region (11) in the ^ v space is a confidence region for 
with confidence coefficient 1 一 a. 
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Usually the set of Pi satisfying (11) is difficult to visualize. However, the 
inequality can be used to determine whether trial matrices are included in 
the region. 


8-7J- Simultaneous Confidence Intervals Based on the 
Lawley-Hotelling Trace 、 

Each test procedure implies a set of confidence regions. The Lawley-HotelI• 
ing trace criterion can be used to develop simultaneous confidence intervals 
for linear combinations of elements of Pj. A confidence region with confi¬ 
dence coefficient 1 - ac is 

(12) tr(p in -pi)x u . 2 (p ia -p ； )>i ： a ) _ ， <w p , mifl (a). 

To derive the confidence bounds we generalize Lemma 5.3.2. 


Lemma 8.7,1. For positive definite matrices A and G, 


(13) |tr 中， y| AY'G X Y. 

Proof. Let — 1 ❿ ’G 中 .Then 


(14) 0<tTA(Y-bG^A-'yG- l (Y~bG^A~') 
^\xAY'G l Y-b\x^'Y-bir +i 2 tr 


ItAY'G~'Y- 


(tr^>T) 2 


which yields (13). ■ 


Now (12) and (13) imply that 

(15) |tr = |tr«>-(p in -p,)| 

^ tr ^ii 2(Pui — Pi) ， (^ r ^*n)~(Pir Pi) 

^ 中 vS, m , n ( a ) 

holds for all pXm matrices We assert that 

(16) tr - < tr ^ 

^ tr^> p m + o) 


holds for all with confidence 1 一 a. 
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The confidence region (12) cnn be explored by use of (16) for various 屯 If 
(}> ik — 1 for some pair (/, /C) and 0 for other elements, then (16) gives an 
interval for If = 1 for a pair (/, /O, — 1 for (/, L\ and 0 otherwise, 

the interval pertains to (3 /K — /3 ;L , the difference of coefficients of two 
independent variables. If <f> lk =1 for a pair U, K\ — 1 for {J.K\ and 0 
otherwise, one obtains an interval for (3 JK — , the difference of coeffi¬ 

cients for two dependent variables. 


8.7.4* Simultaneous Confidence Intervals Based on the Roy Maximum 
Root Criterion 


A confidence region with confidence 1 — cx based on the maximum root 
criterion is 


(17) < W,i ⑷， 

where chjCC) denotes the largest characteristic root of C. We can derive 
simultaneous confidence bounds from (17). From Lemma 5.3.2, we find for 
any vectors a and b 


(18) 

卜 1 ( 自⑴ 一 P #] 2 = {[(| U1 — pKfc } 2 




[(Pin — Pi) ，G ] ，/ *n :[(Pisi — Pi) ，fl ] ^ 

flf (Pin ~ Pi)- 4 !! ;(Pm ~ Pi) ，fl 


a'Ga 


a r Gab' A 


ii : 1 


^ c ^i [(Pin ~ Pi)^ii :(Pwi Pi) r ^ 1 ] a^a b ^\ { l 2 f> 

^ r p.m.n(^) a ， G ^ btA n2b 


with probability I — a; the second inequality follows from Theorem A 、 2、4 of 
the Appendix. Then a set of confidence intervals on all linear comhinations 
a f ^ x b holding with confidence 1 — or is 

(19) <a'^b 

The linear combinations are c'Pjfc — If a x - 1. a t - 0. 

i ^ 1, and b y = 1, b h = 0 } h ^ 1, the linear combination is simply /3 lr It 
a t = 1, a f = 0, i ^ 1, and = 1, b 2 = 一 L b ft = 0, h ^ 1.2. the linear combi¬ 
nation is j3n — /3 U ^ 
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We can compare these intervals with (16) for = ab\ which is of rank 1. 

A A 

The term subtracted from and added lo tr ❿ 'p in ⑴ fc is the square root 
of 

(-0) -trA^ba'Gab' =w r , m ^{a)-a'Ga-b'A~ u \ 2 b. 

A 

This is greater than the term subtracted and added to in (19) because 

xv p .m.S a X pertaining to the sum of the roots, is greater than f p m n (a\ 
relating to one root ‘ The bounds (16) hold for all pXm matrices 屯 ， while 
(19) holds only for matrices ah' of rank L 

Mudholkar (1966) gives a very genera] method of constructing simultane¬ 
ous confidence intervals based on symmetric gauge functions. Gabriel (1969) 
relates confidence bounds to simultaneous test procedures. Wijsman (1979) 
showed that under certain conditions the confidence sets based on the 
maximum root are smallest. [See also Wijsman (1980).] 

8.8. TESTING EQUALITY OF MEANS OF SEVERAL NORMAL 
DISTRIBUTIONS WITH COMMON COVARIANCE MATRIX 

In univariate analysis it is well known that many hypotheses can be put in the 
form of hypotheses concerning regression coefficients. The same is true for 
the corresponding multivariate cases. As an example we consider testing the 
hypothesis that the means of, say, q normal distributions with a common 
covariance matrix are equal. 

Let y { ^ ] be an observation from a = 1， • • • ， / V,、/• = 1， • • • ， The 

null hypothesis is 

(1) JJL ⑷. 

To put the problem in the form considered earlier ir this chapter，let 

(2) X={x,x 2 x W[ + 1 … x‘ v ) = (jyW 2 ) … 戒 )^ 1 )… 成 )） 
with N = N' 七 … -fA^. Let 

(3) Z = Z 2 '** z N z N[ + { z N ) 
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that is, z ia = 1 if + … < a <N { + ••- +N l9 and z ia = 0 otherwise ， 
for i = 1， …， q — 1， and z qa — 1 (all a) t Let p = (p t P 2 X where 

(4) 

Then x a is an observation from MPz a »2)，and the null hypothesis is p x =0. 
Thus we can use the above theory for finding the criterion for testing the 
hypothesis. 

We have 

' N l 0 … 0 N 、 ' 

0 n 2 … o n 2 

(5) A= E z a z ' a = : : ::， 

a=l o 0 ，•■ N q _ 、 N q _y 

N' N . 

(6) c=EV a = (EW) 1 ： 乂 2 ) … Dr 1 ) E^ 0 )- 

ft=*l ^ ft a a i 、 a ’ 

Here A^ = N and C 2 = Z La yi°. Thus 自 2w = E,. a = 歹， say, and 

( 7 ) ^K=L^ a x' a -yNy' 

a 

i,a 

= 乙 ㈤- 歹 ㈣)-》)’. 

i.a * 

For 2 n , we use the formula N2 n = Lx a x^- ^ n A^' a ^l,x a x' a - CA~ l C' 
Let 

1 0 … （3 O' 

0 1 … 0 0 

( 8 ) D = : : : :； 

00". 10 
一 1 一 1 … 一 1 1 
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then 


(9) 


Thus 



(10) CA~ l C' =CD'D' l, A' l D- 1 DC 
^ CD'{DAD ) ^ DC' 



=E Wf ， 


where y (,) = (l/N i )L a y < J\ Thus 

(11) 式 Kf- 

l, a i 

=：c (夕 m(m [ . 

i, a 

It will be seen that X w is the estimator of X when = … =|jl ⑷ and 2 n 
is the weighted average of the estimators of X based on the separate 
samples. 

When the null hypothesis is true, |A^ n |/ |iVX w | is distributed as U p q ^ lnt 
where n 二 N — q. Therefore, the rejection region it the a significance level is 

A 

(12) 入 =/<(*)• 
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The left-hand side of (12) is (11) of Section 8.3, and 


(13) Ni w -Ni n = WM 0, -NTy'-l LyUV^' - L^r^') 

i, n ^ t. a t 

= L^{y (,) -y){y l,) ~yy = H, 


as implied by (4) and (5) of Section 8.4. Here H has the distribution 
q — I). It will be seen that when p = this test reduces to the usual 

厂 test 


(14) 


LMy (,) ~y) 2 n 

(y^--yO)f 9-V 


„(«)- 


We give an example of the analysis. The data are taken from Barnard’s 
study of Egyptian skulls (1935). The 4 ( = q) populations are Late Predynastic 
(i = 1), Sixth to Twelfth (i = 2), Twelfth to Thirteenth {i — 3), and Ptolemaic 
Dynasties {i = 4). The 4( = /?) measurements (i.e.，componems of 乂 ’)）are 
maximum breadth, basialveolar length, nasal height, and basibregmatic height. 
The numbers of observations are N { = 91, N : = 162, = 70, = 75. The 

data are sumn arized as 


(15) ( ，） 5( 2 ) j *( 3 )，)） 

f 133.582418 134.265432 134.371429 135.306 667' 

― 98.307 692 96.462963 95.857143 95.040000 

— 50.835165 51.148148 50.100000 52.093 333 ' 

k 133.000000 134.882716 133.642857 131.466667, 

(16) Nt n 

f 9661.997470 445.573 301 1130.623900 2148.584210 1 

= 445.573 301 9073.115027 1239.211990 2255.812722 

— 1130.623900 1239.211 990 3938.320351 1271.054 662 ' 

^ 2148.584210 2255.812 722 1271.054 662 8741.508 829 t 

From these data we find 

(17) 

f 9785.178098 214.197 666 1217.929 248 2019.820216 1 

7 214.197 666 9559.460890 1131.716372 2381.126040 

— 1217.929 248 1131.716372 4088.731856 1133.473 898 ' 

、 2019.820 21ft 2381.126 040 1133.473 8% ^382.242^20 ； 
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We shall use the likelihood ratio test. The ratio of determinants is 


(18) 


=|/VS n | = 2.4269054 X 10 5 
=INij " 2.9544475 X l0 5 


-0i214344. 


Here N = 398, n = 394, p = 4, and q = 4. Thus k = 393. Since n is very large, 
we may assume 一 fclogi7 4 3 394 is distributed as x 2 with 12 degrees of 
freedom (when the null hypothesis is true). Here —k log (7 — 77.30. Since the 
1% point of the ^^-distribution is 26.2, the hypothesis of — |jl (2) = = 

jjl (4) is rejected. 1 


8.9. MULTIVARIATE ANALYSIS OF VARIANCE 

The univariate analysis of variance has a direct generalization for vector 
variables leading to an analysis of vector sums of squares (i.e” sums such as 
T.x a x , a ). In fact, in the preceding section this generalization was considered 
for an analysis of variance problem involving a single classification. 

As another example consider a two-way layout. Suppose that we are 
interested in the question whether the column effects are zero. We shall 
review the analysis for a scalar variable and then show the analysis for a 
vector variable. Let Y [)9 i = 1 ， … ， r ，y - 1,,.., c, be a set of rc random 
variables. We assume that 

(1) ^Y tj ^ + u r f=l ， ." ， r ， / = 

with the restrictions 

(2) E a,= Eo, 

r= 1 ； = 1 

that the variance of is a 2 y and that the Y fj are independently normally 
distributed. To test that column effects are zero is to test that 

(3) Vj = 0, 

This problem can be treated as a problem of regression by the introduction 
f The above computalions were given by Bnrllcil (1 ( M7). 
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of dummy fixed variates. Let 


(4) 2 oo,i ； = 

Z/cO*" — 1 ， 
-0, 
Z 0k,tj = 

= 0 , 


Then (1) can be written 


(5) 




+ 53 ^k^kO.ij v k ^Qk,ij * 

k ^ l k = l 
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k =?t i, 

k =?t;. 


The hypothesis is that the coefficients of z 0kJj are zero. Since the matrix of 
fixed variates here, 


( 6 ) 


〜 ,11 • 

“ z oo f 

2 10.11 ■ 

’• 2 io. 

2 20,1L * 

Z 20, 

i z 0f.n * 

" Z i)c 9 


is singular (for example, row 00 is the sum of rows 10,20,0), one must 
elaborate the regression theory. When one does, one finds that the test 
criterion indicated by the regression theory is the usual 尸 -test of analysis of 
variance. 

Let 

Y =^^r 

tj 

} 


⑺ 
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a = E(m.+yj 2 
卜 EC- 

i-f i ) 

*=/-E(y ; -y ) 2 

/ 

/ 


Then the F-statistic is given by 


(9) 





Under the null hypothesis, this has the 尸 •distribution with c _ 1 and (广一 1). 
(c 一 1) degrees of freedom. The likelihood ratio criterion for the hypothesis 
is the rc/2 power of 


( 10 ) 


a _ 1 

= i + {( c —l)/ [(/ *-i)( c -i)]} 尸 . 


Now let us turn to the multivariate analysis of variance. We have a set of 
p-dimensional random vectors Y t} , f = l，‘.，，/% with expected 

values (1)，where the \% and the v’s are vectors, and with covariance 
matrix X, and they are independently normally distributed. Then the same 
algebra may be used to reduce this problem to the regression problem. We 
define U, by ⑺ and 

^-L(y ， r Y <ry.i + y.M -y.-y .： + y.y 

-r^Y ； 1 + rcYr_ t 

K] 1 i 

B-rZ{y, 1 - Y -){ Y r Y .y 

j 

= Y'-rcY Y'. 

• j • j ■••• 


(11) 
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Table 8.1 


Location 



Varieties 



Sums 

M 

s 

V 

T 

P 

UF 

81 

105 

120 

110 

98 

514 


81 

82 

80 

87 

84 

414 

W 

147 

142 

151 

192 

M6 

778 


100 

116 

112 

148 

L08 

584 

M 

82 

77 

78 

131 

90 

458 - 


103 

105 

117 

140 

130 

595 

C 

120 

121 

124 

141 

125 

631 


99 

62 

96 

126 

76 

459 

GR 

99 

89 

69 

89 

104 

450 


66 

50 

97 

62 

80 

355 

D 

87 

77 

79 

102 

96 

441 


68 

67 

67 

92 

94 i 

338 

Sums 

616 

611 

621 

765 

659 

3272 


517 

482 

569 

655 

572 

2795 


A statistic analogous to (10) is 


( 12 ) 


I/ll 

\A^B\ 


Under the null hypothesis，this has the distribution of U for p，n = (r — 1). 
(c — 1) and < 7 , = c — 1 given in Section 8.4. In order for A to be nonsingular 
(with probability IX we must require p <{r- lXc - 1 ). 

As an example we use data first published by Immer ， Hayes, and Powers 
(1934)，and later used by Fisher (1947a)，by Yates and Cochran (1938), and by 
Tukey (1949). The first component of the observation vector is the barley 
yield in a given year; the second component is the same measurement made 
the following year. Column indices run over the varieties of barley, and row 
indices over the locations. The data are given in Table 8.1 [e.g.，= in the 
upper left-hand comer indicates a yield of 81 in each year of variety M in 
location UF\ The numbers along the borders are sums. 

We consider the square of (147, 100) to be 


147 1 

liooj 


( 


147 


,(21,609 14J00) 

; 14,700 10,000 ) 
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Then 


(13) 


y Y 380,944 

^ ，!，! 1315,381 


315,381 \ 
277,625 厂 


(14) 




2,157,924 

1,844,346 


1,844,346' 
1,579,583 j 


( 15 } 




1,874,386 

1,560,145 


1,560,145) 
1,353,727 J* 


(16) 




Then the error sum of squares is 


(17) 


, = (3279 802) 

{ 802 4017 j ’ 


the row sum of squares is 


(18) 




7.188 \ 
10,345 ) 


and the column sum of squares is 


(19) 

The test criterion is 


„ = 2788 2550 \ 

_ [2550 2863 J. 


( 20 ) 



3279 

802 

\A\ 

802 

4017 

\A+B\ ~ 

6U67 

3352 


3352 

6880 


= 0.4107. 


This result is to be compared with the significant point for. t/ 2 . 4 20 . Using the 
result of Section 8.4, we see that 


1 一 V0.4107 19 

V0.4107 


is to be compared with the significance point of 38 , This is significant at 
the 5% level. Our data show that there are differerces between varieties. 
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Now let us see that each F-test in the univariate analysis of variance has 
analogous tests in the multivariate analysis of variance. In the linear hypothe¬ 
sis model for the univariate analysis of variance, one assumes that the 
random variables have expected values that are linear combina¬ 

tions of unknown parameters 

(2i) ^y a -Z^z ga , 

8 


where the /3，sare the parameters and the z’s are the known coefficients. The 
variables {Y a } are assumed to be normally and independently distributed with 
common variance cr 2 . In this model there are a set of linear combinations, 
say where the y's are known, such that 

n N \ 2 N 

(22) E E - E d a ,Y a Y, 

/ = 1 l a = 1 } a r p=l 


is distributed as a 2 x 2 with n degrees of freedom. There is another set of 
linear combinations, say where the 沴 ’s are known, such that 


(23) 


m / N 

g= 1 V a = 1 


N 


ct^ = l 


is distributed as c 2 x 2 with m degrees of freedom when the null hypothesis is 
true and as <r 2 times a noncentral x 2 when the null hypothesis is not true; 
and in either case b is distributed independently of a. Then 


b n _ n 


has the /^distribution with m and n degrees of freedom, respectively，when 
the null hypothesis is true. The null hypothesis is that certain p 9 s are zero. 
In the multivariate analysis of variance, Y v . ., 9 Y N are vector variables with 
p components. The expected value of Y a is given by (21) where is a vector 
of p parameters. We assume that the {Y a } are normally and independently 
distributed with common covariance matrix X. The linear combinations 


Ey la K a can be formed for the vectors. Then 


(25) 


A- L L Y-ai ； E y ia y a 


E d ap Y a Y ； 

t,p=l 
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has the distribution WCL y a). When the null hypothesis is true, 


(26) 



N 

E 4 > ga Y a 

or= L 


N 


E 

a • 沒 =1 


has the distribution m), and B is indeper dent of A. Then 


\A\ = IIX 邮 _ 

+ — \ Ld ap Y a Y^+Y ： c^Y a Y- 


has the 讲 ^distribution. 

The argument for the distribution of a and b involves showing that 
SY. a y ia Y a = 0 and SY. a ^> sa Y a = 0 when certain /3’s are equal to zero as 
specified by the null hypothesis (as identities in the unspecified /3’s), Clearly 
this argument holds for the vector case as well. Secondly, one argues, in the 
univariate case，that there is an orthogonal matrix 平 =() such that when 
the transformation - T. a ^ a Z a is made 

n 

a ^ E d aP Lz y z s = E 

a, 冷， 5 a= I 

(28) + 

、 J n +m 

b = E C ,r/ 3 ^ry^fi Z 7 Z « = E Z l • 

a 、 h 5 a»n +■ 1 


Because the transformation is orthogonal, the [Z a ] are independently and 
normally distributed with common variance a 2 . Since the Z a , a = 1 ， •••，；!， 
must be linear combinations of T. a y ia Y a and since Z a , a ： = n + 1， • • • ， n + m ， 
must be linear combinations of T. a (f) ga Y a , they must have means zero (under 
the null hypothesis). Thus a/a 2 and b/a 2 have the stated independent 
^^distributions. 

In the multivariate case the transformation - E a ^ a Z a is used，where 
Yp and Z a are vectors. Then 

n 

A = E 邱 Z y Z '6= E Z a Z a> 

Ti ® ti= 1 

(29) + 

E c a|8 «^ 7 «^ s z 7 z;= £ z a z' a 

a, /S , y, B a^n + 1 


because it follows from (28) that T. a ^d a(i 4f ay il/p S ^ 1, y— 5 ^ n ? and =0 
otherwise, and — 1, n + 1 < y- 5 < n 4-m ? and = 0 other¬ 

wise. Since ^ is orthogonal, the {Zj are independently normally distributed 
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with covariance matrix : The same argument shows SZ a — 0, a = … 

n+m，under the null hypothesis. Thus A and B are independently dis¬ 
tributed according to PF(X, n) and m) ， respectively. 


8.10. SOME OPTIMAL PROPERTIES OF TESTS 
8.10.1. Admissibility of Invariant Tests 


In this chapter we have considered several tests of u linear hypothesis which 
are invariant with respect to transformations that leave the null hypothesis 
invariant. We raise the question of which invariant tests are good tests. In 
particular wc ask for iidmissihlc procedure.*;, lhat is, procedures that cannot 
be improved on in the sense of smaller probabilities of Type I and/or Type 
II error. The competing tests are not necessarily invariant. Clearly, if an 
invariant test is admi^ihlc in the class oT all lests, it is admissible in the class 
of invariant tests. 

Testing the general linear hypothesis as treated here is a gericTalizalion of 
testing the hypothesis concerning one mean vector as treated in Chapter 5, 
The invariant procedures in Chapter 8 arc generalizations of the r : -tesl. 
One way of showing a procedure is admissible is to display a prior distribu¬ 
tion on the parameters such that the Bayes procedure is a given test 
procedure. This approach requires some ingenuity in constructing the prion 
but the verification of the property given the prior is slraighlforward. Prob¬ 
lems 8,26 and 8.27 show that the Bartlett-Nanda-Pillai trace criterion V and 
Wilks’s likelihood ratio criterion U yield admissible tests. The disadvantage 
of this approach to admissibility is that one must invent a prior distribution 
for each procedure; a general theorem does not cover many cases. 

The other approach to admissibility is to apply Stein’s theorem (Theorem 
5.6.5), which yields general results. The invariant tests can be stated in terms 
of the roo^s of the determinantal equation 

(1) |H — A(H + G)|=0 ， 

where H = = W { W\ and G = NZq = There is also a matrix 

民 (or W 2 ) associated with the nuisance parameters p:. For convenience, we 
define the canonical form in the following notation. Let W { =X ip X ml. 

(p X r), Wj — Z (p X n\ = 三 ， trY — and <fZ — 0 ： the columns 
are independently normally distributed with covariance matrix X. The null 
hypothesis is E ^ 0, and the alternative hypothesis is 三婪 0. 

The usual tests are given in terms of the (nonzero) roots of 


( 2 ) 


xx' - \{zz' +jcr)| = |jcr - A(t/-yy*)| = o, 
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where U = XX f + YY f 4- ZZ\ Expect for roots that are identically zero，the 
roots of (2) coincide with the nonzero characteristic roots of X f (U — YY , )" l X. 
Let V={X, Y,U) and 

( 3 ) M{V) =X'{U - YY'Y'X. 

The vector of ordered characteristic roots of MiV) is denoted by 

( 4 ) ( a , ，…，入寧 ))， 

where '!>*•• > > 0. Since the inclusion of zero roots (when m>p) 

causes no trouble in the sequel, we assume that the tests depend on 

The admissibility of these tests can be stated in terms of the geometric 
characteristics of the acceptance regions. Let 

R™ = {\e/? m |A,> A 2 > ••- > A ffl > 0}, 

(5) 

{\ £^^1 a, > 0. \ m 2 ； 0}. 

It seems reasonable that if a set of sample roots Isads to acceptance of the 
null hypothesis, then a set of smaller roots would as well (Figure 8.2). 

Definition 8.10.1. A region A c is monotone if \ v e R 忑 ， and 

v, ^ A r f = 1, imply v ^A. 

Definition 8.10.2. For A c the extended region is 

⑻ -4* = U {( x 7T( 1 ) ， . * - ， 1-^ » 

7T 

where tt ranges over all permutations of (1 5 … . ， m). 


入 2 



Figure 8.2. A monotone acceptance region. 
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The main result, first proved by Schwartz (1967), is the following theorem: 

Theorem 8.10.1. If the region A c is monotone and if the extended 
region A* is dosed and convex, then A is the acceptance region of an admissible 
test 

Another characterization of admissible tests is given in terms of majoriza- 
tion. 

Definition 8.10.3* A vector \ =( 入 " ， •• ，入 m )' weakly majorizes a vector 
if 

( 7 ) ^ ^i] + ^2] ^ v u] + v 【 2]，.. •，+ ■.. +A t m ] 乏 v [i] + … + v [ m ]> 

where and / = 1， • • • ， m ，are the coordinates rearranged in nonascending 
order. 

We use the notation X > w v or v< w \ if X weakly majorizes v. If 
X, v , then X> H ,v is simply 

( 8 ) Aj ^ Vj, Aj + A2 Vj + V2 ， • • • ， Aj + * • * + A m ^ v! + … + v m • 

If the last inequality in (7) is replaced by an equality, we say simply that X 
majorizes v and denote this by X >- v or v -< \. The theory of majorization 
and the related inequalities are developed in detail in Marshall and Olkin 
(1979). 

Definition 8.10A A region A is monotone in majorization if X 
v e , v< w \ imply v (See Figure 8.3.) 

Theorem 8,10.2. If a region A c is closed, convex^ and monotone in 
majorization^ then A is the acceptance region of an admissible test. 


a 2 



Figure 8.3. A region monotone in majorization. 
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Theorems 8.10.1 and 8.10.2 are equivalent; it will be convenient to prove 
Theorem 8.10.2 first. Then an argument about the extreme points of a 
certain convex set (Lemma 8.10.11) establishes the equivalence of the two 
theorems. 

Theorem 5,6.5 (Stein’s theorem) will be used because we can write the 
distribution of {X y Y y Z) in exponential form. Let U ^XX r -f YY f + ZT ==(〜） 
and X*" 1 = For a general matrix C = let vec(C) - 

(c’j ， … ， c r k )\ The density of {X, Y y Z) can be written as 

(9) f(X, Y t Z) =^(S,H,2)exp{trS , 5 ： - , ^+trH , 2- | y- ^tr J-'t；} 

= A"(S,H,X) exp{«’ ⑴ _y (I ) + w? 2 ) ： y( 2 ) + 

where is a constant, 

« (I) = vec(S"'H), « (2) = vec(X _I H), 

« (3 ) = — 士 (tr n ， 2ar l2 ， ... ， 2cr ip ， cr 22 , … ， cr 外 )% 

( 10 ) 

^ (1) = vec(Jt), ^ (2) = vec(r), 

■^(3) 一 ( j 5 Mj2 »* < • > » . - ■ i , 

If we denote the mapping {X,Y y Z)^y =(〆 ”，：y (’ 2 ” 知 )’ by 茗，尸 g(X y Y y Z\ 
then the measure of a set A in the space of y is rn{A) = where 

fx is the ordinary Lebesgue measure on 尺 〆 m+r+n ). We note that {X, Y, U) is 
a sufficient statistic and so is = ( 3 ( 1 ” 3 ( 2 ) ， 3 ( 3 ))’. Because a test that is 
admissible with respect to the class of tests based on a sufficient statistic 
is admissible in the whole class of tests, we consider only tests based on a 
sufficient statistic. Then the acceptance regions of these tests are subsets in 
the space of y. The density of y given by the right-hand side of (9) is of the 
form of the exponential family, and therefore we ^an apply Stein’s theorem. 
Furthermore, since the transformation {X, YjU)-^y is linear，we prove the 
convexity of an acceptance region of (X, Y y U ). The acceptance region of an 
invariant test is given in terms of X(M(K)) = (A l? ..., A m )\ Therefore, in 
order to prove the admissibility of these tests we have to check that the 
inverse image of A y namely, A = {V\ \(Af(F)) e/1), satisfies the conditions 
of Stein’s theorem, namely, is convex. 

Suppose V- = {X^XjyU^ ^A f i = 1,2, that is, A. By the convex¬ 

ity of Ay /?X[M(F|)] -h q\[M{V 2 )] for 0 </7 = 1 — 9 ^ L To show pV l + 
qV 2 ^A y that is, \[M{pV { -f qV 2 )\ ^A y we use the property of monotonicity 
of majorization of A and the following theorem. 
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a 2 



Figure 8.4. Theorem 8.10.3. 


Theorem 8*103. 

(11) \[M{pV x +gV 2 )] > hP X[M(K,)] +^[1/(^)]. 

The proof of Theorem 8.10.3 (Figure 8.4) follows from the pair of 
majorizations 

(12) X[M(pV, +qV 2 )]> w \[pM(V l ) +qM(V 2 )] 

〜 . PM 寧。] + 拃[澤2)]. 

The second majorization in (12) is a special case of the following lemma. 

Lemma 8.10.1. For A and B symmetric, 

(13) \(A+B)> ^\{A) +X(B). 

Proof By Corollary A.4.2 of the Appendix, 

k 

(14) Ha,.(^ 4 + B)= max trR^A +B)R 

< max tr R AR 4 - max tr R ' BR 

R , R = ( k RR=i k 

k k 

= LK(A) + E 入 ,(fi) 

i - i t- \ 

k 

=E {A,(^) + A,(B)}, k= 1 ， .... p. ■ 
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Let A> B mean 4 - B is positive definite and A>B mean A- B is 
positive semidefinite ， 

The first majorization in (12) follows from several lemmas. 

Lemma 8.10.2 

(15) pU ] +qU 2 -(pY l +qY 2 )(pY l + qY 2 )' 

Proof. The left-hand side minus the right-hand side is 

(16) pl\Y; +qY 2 Y ； W;- 作 K ； -pq(Y y Y^Y 2 Y\) 

= p(\-p)Y i Y' l + q (l-q)Y 2 Y{~pq(Y 1 Yi + Y 2 Y' l ) 

= pq(Y l d. ■ 

Lemma 8.10.3. If A > B > 0 > then A 1 < 5' 

Proof See Problem 83L ■ 

Lemma 8.10.4* If A > 0, then f{x. A) = x f A^ l x is convex in (x, A\ 

Proof See Problem 5.17. ■ 

Lemma 8.10.5. If A } > 0 ? A 2 > 0, then 

(17) {pB ] +qB 2 )'{pA l +qA 2 )~\pB l + qB 2 ) ^pB\A^B x +qB' 2 A^B 2 . 
Proof From Lemma 8.10,4 we have for all y 

(18) py'B\A^B,y+qy'B f 2 A ； 'B 2 y 

~y'(P B [ +gB 2 )'(pA 1 +qA 1 )~\pB l +qB 2 )y 
^p{B ] y) , A ； '(B l y) +q{B z y)'A 2 1 (B 2 y) 

- {pB^ + qB 2 y)'{pA l +qA 2 )' ] (pB^y +qB 2 y) 

> 0 * ■ 

Thus the matrix of the quadratic form in y is positive semidefinite. ■ 

The relation as in (17) is sometimes called matrix convexity, [See Marshall 
and Olkin (1979〕，] 
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Lemma 8.10.6. 

(19) M( pV v +qV 2 )< pM{V x )fqM(V 2 ), 

where V v = {X^Y u U x \ V 2 = (X 2 J 2 ,U 2 \ [/, U 2 -Y 2 Y[>^ 0<p 

=1 - q ^1. 

Proof. Lemmas 8.10.2 and 8.10-3 show that 

(20) [pUy+qU.-ipY.+qY.XpY.+qY^]-' 

This implies 

(21) M{pV y +qV 2 ) 

^(pX.+qX^'ipiU.-Y^+qi^-Y^y'ipX.+qX,). 

Then Lemma 8.10.5 implies that the right-hand side of (21) is less than or 
equal to 

(22) pX'^U,-¥,¥[)-'X y + qX' 2 (U 2 -Y 2 Y^~ l X z = pM{V,)+qM(V 2 ). 

■ 

Lemma 8*10,7. If A < B, then X(^) -< W \(B). 

Proof. From Corollary AA2 of the Appendix, 

k k 

(23) Y\ = max tr R f AR < max trR'BR = kJ B) 7 

/ = h / =1 

k = l ， ". ， p. ■ 

From Lemma 8,10,7 we obtain the first majorization in (12) and hence 
Theorem 8.10.3, which in turn implies the convexity of A Thus the accep¬ 
tance region satisfies condition (i) of Stein’s theorem. 

Lemma 8.10.8. For the acceptance region A of Theorem 8.10.1 or Theorem 
8.10.2, condition (ii) of Stein’s theorem is satisfied. 

Proof. Let to correspond to ( 屯 ，平 ， ©); then 

( 24 ) <«> -V = + w ， (3)^3) 

= tr<H + tr 中 T- \lr<dU, 
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where © Is symmetric. Suppose that is disjoint from A = 

{FlXCAfCD) e/1). We want to show that in this case © is positive semidefi- 
nite. If this were not true，.then 

I 0 0 

(25) © = Z) 0-/0 

0 0 0 , 

where D is nonsingular and -/ is not vacuous. Let X-(l/y)X Q7 Y = 

1 1 0 o' 

(26) U = (D , )~ 1 0 y/ 0 D \ 

,0 0 /, 

and V— (X, Y 7 U\ where X Qi Y Q are fixed matrices and y is a positive 
number，Then 

i i i 卜 0 0 ) 

(27) to>= -tr^>^ 0 + -tr + ^tr 0 yl 0 >c 

7 7 ( 0 0 0 , 

for sufficiently large y. On the other hand, 

(28) X(Af(F)) =x{;r(t/-— 

! [ r [i oo] , I -1 

7 1 lo 0 /j ^ J 

— 0 

as y ^ oo. Therefore ， V^A for sufficiently large y. This is a contradiction. 
Hence ® is positive semidefinite. 

Now let o) L correspond to (<!>,,0, /), where Then / + A® is 

positive definite and 4 - A 少爹 0 for sufficiently large A, Hence cOj 4 - Aco e 
H — fi 0 for sufficiently large A. ■ 

The preceding proof was suggested by Charles Stein. 

By Theorem 5.6.5, Theorem 8.10.3 and Lemma 8.10,8 now imply Theorem 

8 . 10 . 2 . 

To obtain Theorem 8.10.1 from Theorem 8,10.2, we use the following 
lemmas. 

Lemma 8.10.9, AczR^ is convex and monotone in majorization if and 
onfy if A is monotone and A* is convex. 
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a 2 



Proof. Necessity. If A is monotone in majorization, then it is obviously 
monotone. A* is convex (see Problem 835). 


Sufficiency, For let 


(29) 


= {xUe /?'文 > K .X }， 

D(\) 


It will be proved in Lemma 8.10.10, Lemma 8.10.11, and its corollary that 
monotonicity of A and convexity of A* implies C(\)aA*. Then D(X)= 
C(\) CiR^ a A* =A. Now suppose v e R f ^ and v < w \. Then v e 
D(\) oa. This shows that A is monotone in majorization. Furthermore, if 
A* is convex, then A = R% is convex. (See Figure 8.5.) ■ 


Lemma 8.10.10. Let C be compact and convex^ and let D be convex. If the 
extreme points of C are contained in D % then C cD. 


Proof. Obvious. ■ 

Lemma 8.10.1L Every extreme point of C(\) is of the form 

( 30 ) 

where tv is a permutation of (1， ••” m) and — — d k = S k + l = - 5^ 

- 0 for some k. 


Proof. C(\) is convex. (See Problem 8.34.) Now note that C(X) is permu- 

tation-S} mmetric, that is, if . v m )' e C(\X then (.v JT(1) . e 

C(X) for any permutation 7r. Therefore, for any permutation tt. 7r(C*t X))= 
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{U_ (l) .r_ lrrn )1jt: e C(X)} coincides with C(X). This implies that if 

is an extreme point of C(X), then {x n{V)9 . is also an 

extreme point. In particular, ” ww is an extreme point. Con¬ 
versely. if (x,. x m ) e R’: is an extreme point of C(X), then 

(x ff(1) ,. “ ， x n(lf } y is an extreme point. 

We see that once we enumerate the extreme points of C(X) in ? the 
rest of the extreme points can be obtained by permutation. 

Suppose , An extreme point, being the intersection of m hyper¬ 

planes. has to satisfy m or more of the following 2m equations: 

E y \ X[ = 0 ， F x : x { = A,, 

: x 2 = 0 ， F-, : .tj +a*-> = 入 】 + 入 ， • 

£ ”, F in : -V, + +.v,„ = A, + … +A,„. 

Suppose that k is the first index such that E k holds. Then x^R n ^ implies 
0-x k >x i+] ^ >x w > 0, Therefore, E k ， … ， E … hold. The remaining 
k - \ - tn - (m - /c +1) or more equations are among the F’s. We order 
them as F t/ where … < i h / > — 1. Now 乙 〈… < implies 

if > I with equality if and only if /, - = /. In this case F' ， … t F k _' 

hold (/ > k - 1), Now suppose i f > l. Since x k = • • = x m = 0, 

(32) 厂”：心 + … ^ A, + … +A am + … +A, V 

But x { + +x k ,_ ] s A! + … + 入又-卜 and we have + ••• +A~= 0. There¬ 
fore* ◦ : + … +A~ > > … > A", > 0. In this case F k _ '，…， F m reduce 
to the same equation x { + = A】+ … +A^_j, It follows that x 

satisfies k — 2 more equations, which have to be 匕 ，…， F k _ 2 . We have 
shown that in either case E k … ，h hold and this gives the 
point /5 = (A P — An-Or.qO), which is in R% nC(X). Therefore，/3 is an 
extreme point, _ 

Corollary 8.10.1. C(X) 

Proof. If A is monotone, then A* is monotone in the sense that if 

X — (A,.A,")’6/1*, v = (i^. v x <, \ n i = 1 ， … ， m ，then v 

(See Problem 8.35J Now the extreme points of C(X) given by (30) are in A* 
because of permutation symmetry and monotonicity of A*. Hence，by Lemma 
8 」 aiacu)c ，， ■ 
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Proof of Theorem 8J0.L Immediate from Theorem 8.10.2 and Lemma 
8.10.9. ■ 

Application of the theory of Schur-convex functions yields several corollar¬ 
ies to Theorem 8.10,2 

Corollary 8.10*2. Let g be continuous, nondecreasing, and convex in [0,1). 
Let 

m 

( 33 ) /( 人） =/( 入 i ，".， 入 《) = E 幺(入/). 

1 = L 

Then a test with the acceptance region A = (X.|/(X) < c) is admissible. 

Proof. Being a sum of convex functions / is convex，and hence A is 
convex. A is closed because / is continuous. We want to show that if 
fix) < c and y < w x (x 9 y ^R^) 9 then f(y) <, c. Let x k = y k = E- = , y r 

Then y< w x if and only if x k >y k7 = Let f(x) = h(x l7 ... 9 x m )= 

g(x l ) + S^i ^ suffices to show that h(x l 9 ... 9 x m ) is increasing 

in each x ( . For / < m — 1 the convexity of g implies that 

(34) h(x x , E,,.,,x m ) - h[x l x it ,,x m ) 

= ^(^i + £ ) ~g( x i) - + -8( x i + i 

For i = m the monotonicity of g implies 

(35) h(x l ,...,x m +e)-n(x l ,...,x m )=g(x m + e) -g(x m )>0. ■ 

Setting g(\) = -log(l - A) ? g(\) = A/(l - A) ? g(\) = A, respectively ， 
shows that Wilks’ likelihood ratio test, the Lawley - Hotelling trace test, and 
the Bartlett - Nanda - Pillai test are admissible. Admissibility of Roy’s maxi¬ 
mum root test A\ \ x <c follows directly from Theorem 8.10.1 or Theorem 
8.10.2. On the contrary，the minimum root test ， A, < c 7 where t = min(m, p\ 
does not satisfy the convexity condition. The following theorem shows that 
this test is actually inadmissible. 

Theorem 8.10.4. A necessary condition for an invariant test to be admissible 
is that the extended region in the space of ♦ • • ， is convex and monotone. 

We shall only sketch the proof of this theorem [following Schwartz (1967)]. 
Let = d i9 / = 1 ， • ， ， 〆， and let the density of d u ^^d t be f(d\ v\ where 
v — {v [y , • • ， v { ) 1 is defined in Section 8.6,5 and f(d\ v) is given in Chapter 13, 
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The ratio /(rfl v)//(rf|0) can be extended symmetrically to the unit cube 
(0 < d { < 1, i = 1， •" ， ？)♦ The extended ratio is then a convex function and is 
strictly increasing in each d f . A proper Bayes procedure has an acceptance 
region 

( 36 ) /微抓(拉 

where ll(v) is a finite measure on the space of v’s. Then the symmetric 
extension of the set of d satisfying (36) is convex and monotone [as shown by 
Birnbaum (1955)]. The closure (in the weak*topology) of the set of Bayes 
procedures forms an essentially complete class [Wald (1950)]. In this case the 
limit of the convex monotone acceptance regions is convex and monotone, 
The exposition of admissibility here was developed by Anderson and 
Takemura (1982), 


8.10.2. Unbiasedness of Tests and Monotonicity of Power Functions 

A test T is called unbiased if the power achieves its minimum at the null 
hypothesis. When there is a natural parametrization and a notion of distance 
in the parameter space，the power function is monotone if the power 
increases as the distance between the alternative hypothesis and the null 
hypothesis increases. Note that monotonicity implies unbiasedness. In this 
section we shall show that the power functions of many of the invariant tests 
of the general linear hypothesis are monotone in the invariants of the 
parameters, namely, the roots; these can be considered as measures of 
distance. 

To introduce the approach，we consider the acceptance interval (-o, a) 
fci testing the null hypothesis ii = 0 against the alternative 弘垆 0 on the 
basis of an observation from N{fi f a 2 ). In Figure 8,6 the probabilities of 
acceptance are represented by the shaded regions for three values of /i. It is 
clear that the probability of acceptance decreases monotonically (or equiva¬ 
lently tlie power increases monotonically) as fi moves away from zero. In 
fact，this property depends only on the density function being uni modal and 
symmetric. 



Figure 8»6. Three probabilities of acceptance- 
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In higher dimensions we generalize the interval by a symmetric convex set, 
and we ask that the density function be symmetric and unimodal in the sense 
that every contour of constant density surrounds a convex set. In Figure 8.7 
we illustrate that in this case the probability of acceptance decreases mono 
tonically. The following theorem is due to Anderson (1955b). 

Theorem 8.10.5, Let E be a convex set in n-space，symmetric about the 
origin. Let f{x) be a function such that (i) f(x) - /( -x\ (ii) {x\f(x) > w}= 
K u is convex for every u (0 <u and (iii) f E f(x) dx d Then 

(37) fj'(x+ky)dx>fj(x+y)dx 
for 0 < /c ^ 1. 

The proof of Theorem 8.10.5 is based on the following lemma. 

Lemma 8,10,12. Let E,F be convex and symmetric about the origin. Then 

(38) V{(E + ky)r\F)>V{(E+y)nF} 1 

where 0 <k <1 and V denotes the n-dimensional volume. 

Proof. Consider the set cx(E + j0 + (1 — cx){E ~y) = cxE + (1 — cx)E 4- 
(2 a - l)y which consists of points cx{x +y) + (1 - aXz -y) with x ， z e £ 
Let a 0 = (/c + 1)/2, so that 2 a 0 — 1 = fc. Then by convexity of E we have 

(39) a 0 (£+^) + {1- a Q ){E-y)czE + ky t 


Hence by convexity of F 

tto [( 五 +30 。厂 ] + (1 - DF]c(E +k}-)nF 
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and 


(40) ^{« 0 [(£+ 3 >) nF] + (1 - cr 0 ) [(五 -j0 门 ,】} <y{(E +f<y) 。尸 }. 


Now by the Brunn - Minkowski inequality [e.g，，Bonnesen and Fenchel (1948 )， 
Section 48], we have 

(41) F*/"{a 0 [(£+j)nF] + [l-a 0 )[{E-y)nF]} 

>a 0 V'/ n {(E+y) nF)+(l-a 0 )V^ n {(E-y)nF) 

= a 0 V l /''{(E+y)nF}+(\-a 0 )V^{(-E + y)n(-F)) 
^V l ^ n {(E+y) nF). 

The last equality follows from the symmetry of E and ■ 


Proof of Theorem S. 10.5. Let 


(42) //(u) = k{(£ + /o-)n/C 1( }, 

(43) //*(u) = ^{(£+^)n/C u }. 

Then 


(44) 


Similarly, 

(45) 


ff(x+y)dx= f f(x)dx 
J E J E” 






'0 J E^y 


r (Us 


U^ }( x^)dxdu 




H^(u)du. 


j f{ x +ky) = j H(u) du. 


By Lemma 8.10,12 ， H(u) > H*{u) t Hence Theorem 8.10.5 follows from (44) 
and (45). ■ 
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We start with the canonical form given in Section 8.10.1. We further 
simplify the problem as folio vs. Let t = min(m, p\ and let 〜 ， … ， v r (v y > v 2 
之 … > v r ) be the nonzero characteristic roots of where S = SX. 

Lemma 8.10.13. There exist matrices B {pXp) and F {mX m) such that 

FF=I mf 

(46) p<m, 

P>m, 

where D v = diagCv,,. v). 



Proof. We prove this for ihe case p <m and v p > 0. Other cases can be 
proved similarly. By Theorem A.2.2 of the Appendix there is a matrix B such 
that 

(47) BW =1, =D V . 

Let 

(48) F,=D ； ^Bs. (pXm). 

Then 

(49) 

Let F f - (JF^ be a full mxm orthogonal matrix. Then 

(50) B-BF r 2 = 0 

and 

(51) B^F 1 =Bs.(F[ > F z )=B'5('B i B'D ； KF2)^(DI,0). ■ 

Now let 

(52) U = BXF\ V = BZ. 

Then the columns of U,V are Independently normally distributed with 
covariance matrix I and means when p <m 


( 53 ) 


^ 17 = (^, 0 ), 
<#K=0. 
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Invariant tests are given in terms of characteristic roots > l t ) 

of U\W f )~ x U, Note that for the admissibility we used the characteristic 
roots of A, of U f {UU r + Vy r )~ l U rathsr than /, = A f /(1 - \ { X Here it is more 
natural to use l h which corresponds to the parameter value v v The following 
theorem is given by Das Gupta ， Anderson, and Mudholkar (1964). 

Theorem 8.10.6. If the acceptance region of an invariant test is convex in 
the space of each column vector of V for each of fixed values of V and of the 
other column vectors of U，then the power of the test increases monotomcally in 
each v { . 

Proof. Since UU } is unchanged when any column vector of U is multiplied 
by 一 1， the acceptance region is symmetry about the origin in each of the 
column vectors of U. Now the density of U = (，/")，K = ("") is 

(54) f(U,V) 

-jluvy'+ £ [u lt -y[V,) 2 + E E «?； 

I I , * = 1 /»1 

Applying Theorem 810.5 to (54)，we see that the power increases monotoni’ 
cally in each ■ 

Since the section of a convex set is convex，we have the following corollary. 

Corollary 8.10.3 - If the acceptance region A of an invariant test is convex in 
U for each fixed V’ then the power of the test increases monotonically in each v r 

From this we see that Roy’s maximum root test A:l x < K and the 
Lawley - Hotelling trace test A : tr UWV'Y { U < K have power functions that 
are monotonically increasing in each v r • 

To see that the acceptance region of the likelihood ratio test 

(55) A: n(l+/,)</C 

^ -a I 

satisfies the condition of Theorem 8.10.6 let 


(27T) _|(，，+ m),， exp 



( 56 ) 


(FK , )" 1 = T'T, T:pXp 
U* = {u*,...,ul) = TU. 
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Then 

t 

(57) n(i+M + 

I >= 1 

= \U*U*' +l\^\u*u*' +B\ 

= («r 矿 1 «r + i)ibi 
- {u\T'B^T u , + 1)|B|, 

where B + … ^ Since T'B [ T is positive definite, (55) is 

convex in u v Therefore, the likelihood ratio test ha< a power funciion which 
is monotone increasing in each v r 
The Barlleit-Nanda-Pillai trace lest 

1 I 

(58) A: tr U'(UU' + W'y l U= jij ^ K 

has an acceptance region that is an ellipsoid if K<1 and is convex in each 
column of 1/ provided K < h (See Problem 8.36.) For K> l (58) may not 
be convex in each coil mn of U. The reader can work out an example for 
歹 = 2. 

Eaton and Perlman (1974) have shown that if an invariant test is convex in 

U and W\ then the power at v^) is greater than al (i» 1 .r r ) if 

” •，• •， ， We shall not prove this result. Rov s 
maximum root test and the Lawley-Hotelling trace test satisfy the condition, 
but the likelihood ratio and the Bartlett-Nanda-Pillai trace test do not. 

Takemura has shown that if the acceptance region is convex in U and W, 
the set of yjT] for which the power is not greater than a constant is 

monotone and convex. 

It is enlightening to consider the contours of the power function. 

乂 Theorem 8.10.6 does not exclude case (a) of Figure 8.8. 



(a) 



(b) (c) 



Figure 8.8. Contours of power functions. 
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and similarly the Eaton-Perlman result does not exclude (b). The last result 
guarantees that the contour looks like (c) for Roy’s maximum root test and 
the Lawley-Hotelling trace test. These results relate to the fact that these 
two tests are more likely to detect alternative hypotheses where few v/s are 
far from zero. In contrast with this, the likelihood ratio test and the 
Bartlett — Nanda-Pillai trace test are sensitive to the overall departure from 
the null hypothesis. It might be noted that the convexity in y^-space cannot 
be translated into the convexity in 卜 space. 

By using the noncentral density of //s which depends on the parameter 
values 〜•••，〜 Perlman and Olkin (1980) showed that any invariant test 
with monotone acceptance region (in the space of roots) is unbiased. Note 
tliat this result covers all the standard tests considered earlier. 


8.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 

8.11. L Observations Elliptically Contoured 

The regression model of Section 8.2 can be written 

(1) x a = ^z a +e ay <x=\ ， … ， N ， 

where e a is an unobserved disturbance with Se a = 0 and Se a e r a = 2, We 
assume that e Q has a density | A| ~ ^g{e f then X = {SR 2 /p)\ } where 
R : ^e f n \' ] e a . In general the exact distribution of B = T.^\X a z , a A~ x and 
Nl, = - Bz a ){x a -Bz a Y is difficult to obtain and cannot be ex¬ 

pressed concisely. However, the expected value of B is P ? and the covariance 
matrix of vec B is 一 1 with -4 = ,z a z f a . We can develop a larger 
sample distribution for B and Nz. 

Theorem 8.11.1. Suppose (1 /N)A^>A q , z , a z ct < constant, a = 1， 2, ••” 
and either the e^s are independent identically distributed or the e a 7 s are indepen¬ 
dent with c^\e f a ej 2+e < constant for some s> 0. Then B 么 P and '/N uec(B — 
P) has a limiting normal distribution with mean 0 and covariance matrix 

Theorem 8.11.1 appears in Anderson (1971) as Theorem 5,5.13. There are 
many alternatives to its assumptions in the literature. Under its assumptions 
A 2, This result permits a large-sample theory for the criteria for testing 
null hypotheses about p. 

Consider testing the null hypothesis 


(2) 


//:p=r ， 
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where p* is completely specified. In Section 8.3 a more general hypothesis 
was considered for P partitioned as p = (p p p 2 ). However, as shown in that 
section by the transformation (4), the hypothesis p, = can be reduced to a 
hypothesis of the form (1) above. 

Let 

N 

(3) G= £ (〜一 

Qf*=-I 

(4) - P )、 

Lemma 8 . 11 . 1 . Under the conditions of Theorem 8, 11.1 the limiting distri¬ 
bution of H is W{X, <?). 

Proof. Write H as 

(5) H=y[N{B-^)^Ay[N{B-^)'. 

Then the lemma follows from Theorem 8.11.1 and (4) of Section 8.4. ■ 

We can express the likelihood ratio criterion in the form 

(6) -2log A= -N\ogU = N log|/ + G-'ff| 

1 i 1 \ _I 

= N]og/+ 77 | 77 Gj H. 

Theorem 8 . 11 « 2 , Under the conditions of Theorem 8.11.1, when the null 
hypothesis is true, 

⑺ -21ogAi^. 

Proof We use the fact that N log|/ + N~ l C\ = tr C + O p (N~ [ ) when 
since 1 1 + xC\ = l + x tv C + Oix 2 ) (Theorem A.4.8). 

We ha'，e 

( 8 ) tr(^G) l H = N £ £ gh)WP jh ) 

=[vec(B f - G" 1 ®+ec(fl’ -p’) 乂 & 

because {l/N)G ^ 2, {l/N)A^A 0 , and the limiting distribution of 
V^vec(B , -p f ) is N{X®A^ y l ■ 
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Theorem 8.11*2 agrees with the first term of the asymptotic expansion ot 
-2 log A given by Theorem 8.5.2 for sampling from a normal distribution. 
The test and confidence procedures discussed in Sections 8,3 and 8*4 can be 
applied using this ^^distribution* 

The criterion U - k 2/N can be written as f/ = Y\^ x V n where V { is defined 
in (8) of Section 8.4. The term V { has the form of U; that is, it is the ratio of 
the sum of squares of residuals of x ia regressed on x J q: ，" • ， a ,z a to the 
sum regressed on It follows that under the null hypothesis 

are asymptotically independent and —N log V { ^ Thus 
—N log U = i log V t ^ This argument justifies the step-down 

procedure asymptotically. 

Section 8.6 gave several other criteria for the general linear hypothesis: 
the Lawley - Hotel ling trace tr HG~\ the Bartlet 卜 Nanda - Pillai trace tr H{G 
+ //)’ 1 ，and the Roy maximum root of //G' 1 or H{G + H)~ l . The limiting 
distributions of N tr HG ~ 1 and N tr H{G + H)~ ] are again Xp q > The limiting 
distribution of the maximum characteristic root of NHG~ l or NH(G + .V) 一 1 
is the distribution of the maximum characteristic root of H having the 
distributions W(I 7 q) (Lemma 8.11.1). Significance points for these test crite¬ 
ria are available in Appendix B. 


8»11.2. Elliptically Contoured Matrix Distributions 


In Section 8.3.2 the pXN matrix of observations on the dependent variable 
was defined as X= {x ly ... 9 x N \ and the q 乂 N matrix of observations on the 
independent variables as Z = (z,,.*.,z N ); the two matrices are related by 
Note that in this chapter the matrices of observations have N 
columns instead of N rows. 

Let E = (e ly . .., e N ) be a p X N random matrix with density 
\\\^ N/Z g[F-'EE , {FT'l where \=FF'. Define X by 

(9) + 


In these terms the least squares estimator of P is 
(10) B=XZ'(ZZ'y l =CA~\ 

where C = XL' = \X a z a and A = ZZ' = L^ =l z a z' a . Note that the density 
of E is invariant with respect to multiplication on the right by NxN 
orthogonal matrices; that is，£' is left spherical. Then E 1 has the stochastic 
representation 


E r = UTF\ 


(ii) 
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where U has the uniform distribution on U f U = l p , T is the lower triangular 
matrix with nonnegative diagonal elements satisfying EE 1 = 7T r , and F 
is a lower triangular matrix with nonnegative diagonal elements satisfying 
FF r — S* We can write 

( 12 ) 

(13) H = (B —P)/1(B_PV =EZ'A~ l ZE' = FT'U(Z'A 1 Z)UTF'. 

(14) G = (X-^Z)(X-^Z)' - H = EE' - H 

= E(I n ~Z'A~ 1 Z)E' =FT'U'(I n -Z'A- 1 Z)UTF\ 

It was shown in Section 8.6 that the likelihood ratio criterion for H: p = 0. 
the Lawley—Hotelling trace criterion, the Bartlett-Nanda-Pillai trace crite¬ 
rion, and the Roy maximum root test are invariant with respect to linear 
transformations x-^ Kx, Then Corollary 4.5.5 implies the following theorem. 

Theorem 8.11.3. Under the null hypothesis p = 0, the distribution of each 
invariant criterion when the distribution of E f is left spherical is the same as tin 1 
distribution under normality. 


Thus the tests and confidence regions described in Section 8.7 are valid 
for left-spherical distributions B\ 

The matrices Z 1 A~ V Z and I N - Z f A^ l Z are idempotent of ranks q and 
N - q. There is an orthogonal matrix 0 N such that 


(15) OZ r A } ZO f = \ Iq ° 

v 1 0 0 


0 0 
0 /v_ 


0(I N -Z l A- l Z)0' 

The transformation V — O'U is uniformly distributed on V'V= I f> , and 

VK\ 


(16) 

H^KV' 

\ 0 

VK\ 

G=KV' 

0 0 


0 0 





where K = FT r . 

The trace criterion tr HG _1 , for example, is 


(t7) 



[/. ol 

l 

0 0 

\ 

1 =v 


\V\V' 

fk r 

V 


.0 0 

11 

0 

) 


The distribution of any invariant criterion depends only on U (or not 
on T. 
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Since G + H = FT l TF\ it is independent of U. A selection of a linear 
transformation of X can be made on the basis of G + H, Let D be a p X r 
matrix of rank r that may depend on G + H, Define x* = D x a . Then 
- and the hypothesis p = 0 implies D p = 0. Let X* = 

(4，..= = E t) ^D f E, H d = D HD ， G d = D GD. Then 

E' [} = B'D = UTF 1 D 1 . The invariant lest criteria for p 0 = 0 are those for 
P = 0 and have the same distributions under the null hypothesis as for the 
normal distribution with p replaced by r. 


PROBLEMS 

S.1. (Sec. 8.2.2) Consider the following sample (for N = 8): 


Weight of grain 

40 

17 

9 

15 

6 

12 

5 9 

Weight of straw 

53 

19 

10 

29 

13 

27 

19 30 

Amount of fertilizer 

24 

11 

5 

12 

7 

14 

11 18 


Let z 2ct = 1, and let z la be the amount of fertilizer on the aih plot. Estimate p 
for this sample. Test the hypothesis P ( = 0 at t ie 0.01 significance level. 

8.2. (Sec. 8*2) Show that Theorem 3.2.1 is a special case of Theorem 8.2.1. 
[Him: Let ^ = 1, = l,p = 

8J. (Sec 8*2) Prove Theorem 8.2.3. 

8.4. (Sec. 8.2) Show that p minimizes the generalized variance 

N 

E (X, r - -PzJ’ • 

ft™ I 

8.5. 〔Sec S3) In the following data [Woltz, Reid, and Colwell (1948), used by 
R. L Anderson and Bancroft (1952)] the variables are x [y rate of cigarette bum; 
x y the percentage of nicotine ； z t , the percentage of nitrogen; z 2 , of chlorine; 

of potassium; z 4 , of phosphorus ； z 5l of calcium; and z 6 , of magnesium; and 
z 7 = 1; and N = 25: 


/V 

T. X a = 

a=\ 


42 20 彳 
54.03/ 5 


N 

E z„ = 

f=l 


f 53.92' 
62.02 
56.00 
12.25 
89.79 
24.10 
125 


/V 

E 无 )(UV 

» 1 


0/)690 

0.4527 


0.4527) 
6.5921) 9 
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1^1 



1.8311 

-0.3589 

-0.0125 

-0.0244 

1.6379 

0.5057 

0 


-03589 

8.8102 

-0.3469 

0.0352 

0.7920 

0.2173 

0 


I -0.0125 

-0.3469 

1.5818 

-0,0415 

-1.4278 

一 0.4753 

0 

= 

-0.0244 

0.0352 

-0.0415 

0.0258 

0.0043 

0.0154 

0 


I L6379 

0.7920 

-1.4278 

0.0043 

3.7248 

0.9120 

0 


! 0.5057 

0.2173 

-0.4753 

0.0154 

0.9120 

0 3828 

0 


1 0 

0 

0 

0 

0 

0 

0/ 




1 0.2501 

2.6691 、 







-1.5136 

-2.0617 




N 



0.5007 

-0.9503 




I. 

, (z a -z)(x c 

广幻， = 

-0.0421 

-0.0187 




cr = 

I 


-0.1914 

3.4020 







-0.1586 

1.1663 







o 

0 I 





(a) Estimate the regression of x x and x 2 on z,, z 5 , z 6 , and z 7 . 

(b) Estimate the regression on all seven variables ‘ 

(c) Test the hypothesis that the regression on z 2 , z 3y and z 4 is 0. 

8.6, (Sec ， 8.3) Let q = 2, z [a = w a (scalar), z 2a = 1. Show that the [/-statistic for 
testing the hypothesis Pj = 0 is a monotomc function of a T 2 -statistic, and give 
the r 2 -statistic in a simple form. (See Problem 5.1.) 

8.7, (Sec. 8.3) Let z qa = l, let q 2 = 1, and let 

A* = L(2, 0 -2/)(^„-2;)|. 1.?1 =<?- 1- 


Prove that 

(Pm ~ Pi)(^n ^- 4 i2- 4 22 1 ^ 4 2i)(Pin ^ Pi)' ^ (Pia - P!)’ ( 自 in — Pi) r . 


8.8 - (Sec. 8.3) Let =q 2 . How do you test the hypothesis p, = p 2 ? 
8.9, (Sec. 8.3) Prove 


Pm = T.x a {z { a l) -A i2 A22Z ( ^y L (4° - A [2 A 2iZ^){^a l) ^A [2 A22 : L 2) )' 


~ ~ C 2 ^ 4 2 2 l ^ 2 l)(^ll — ^12^22^21) 
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8.10. (Sec. 8.4) By comparing Theorem 8,2.2 and Problem 8,9, prove Lemma 8.4,L 

8.11. (Sec, 8.4) Prove Lemma 8Al by showing that the density of p in and p 2w 

A exp[-|tr 2- l (p in - P?)^n. 2 (Pin- Pt)1 

^2 exp [ — 士 tr 2 自 2 故—？ 2 )乂 22 (自 2 似一 P2 ) ] * 


8.12, (Sec. 8.4) Show that the cdf of U 3t3 n is 

a ^_ U) + ICZL±Mi(^ 

•( 中 - 1 ) + "T^rr [勵 n(2«_l)-r] 

2t>, ( 1 + /T^u 2u^->(l-u)^ 

+ — log Hr- r 3 ( „Vi), 


[Hint ； Use Theorem 8,4.4. The region {0 < < 1,0 <z 2 ^ 1> ^z 2 <u} is the 

union of {O^z, ^ 1,0 <,z 2 ^ w} and {0 <,z ] ^ u/z 2J u s 1}.] 


8.D. (Sec. 8,4) Find Pr{t/ 4 3 n > u}. 

8-14. (Sec. 8,4) Find ?v{U AtA n >ul 

8.15. (Sec, 8.4) For p <,m find ① EU h from the density of G and H. [Hint: Use the 
fact that the density of * s ^(2,5 + 0 if the density of K is 

W(Ji r s) and V u ..^V i are independently distributed as M0, X)-] 


8.16. (Sec. 8.4) 


(a) Show that when p is even, the characteristic function of log U p m n9 say 
4>{t) = <»E e ,tY , is the reciprocal o f a polynomial. 

(b) Sketch a method of inverting the characteristic function of F by the 
method of residues. 

(c) Show that the resulting density of U is a polynomial in 4u and logu with 
possibly a factor of u~ \ 


8.17. (Sec. 8.5) Use the asymptotic expansion of the distribution to compute Pr{—ft 
logt/ 3 .v^M*}f 0 〆 


(a) n = 8, M* = 14.7, 

(b) n = M* = 21.7, 

(c) n = 16, M* = 14,7, 

(d) …16, M* = 2 】 .7. 


(Either compute to the third decimal place or use the expansion to the k~ A 
term.) 
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8.18. (Sec. 8.5) In case p= 3, ^, = 4, and n = N — q=20 y find the 50% significance 
point for k log V (a) using 一 2log A as % 2 and (b) using —k log V as Using 
more terms of this expansion, evaluate the exact significance levels for your 
answers to (a) and (b). 

8.19. (Sec. 8*6.5) Prove for /,• > 0, / = 1 ， … 、 p, 

1=1 1 1=1 1=1 

Comment ： The inequalities imply an ordering of the values of the 
Bartlett-Nanda-Pillai trace，the negative logarithm of the likelihood ratio 
criterion, and the Lawley-Hotelling trace. 

8.20. (Sec. 8.6) The mullivariaie baa density. Let H and G be independently dis¬ 
tributed according to W{J. y m) and n\ respectively* Let C be a matrix 
such that CC’ = // + G> and let 




Show that the density of L is 


r p [ 士 (m +n)] 




I )| ； 


for L and I — L positive definite, and 0 otherwise. 

8.21. (Sec. 8.9) Let Y i} (a p-component vector) be distributed according to S)、 

where = |i 1; = |i + X,- + V) + y i} , E,- 人 ， = 0 = 广 E, 7 1； = the y rJ 

are the interactions. If m observations are made on each Y i} (say 
how do you test the hypothesis \ t - 0, i - How do you test the 

hypothesis y }) ~ 0, i = 1 ,..., r, j — 1 ,..., c? 

S22. (Sec 、 8*9) The Latin square. Let F (；) i ， j= l ， ... ， r，be distributed according to 
2), where ^EY i} = |ji (; = 7 + X, + v ; - + and k = j — i + 1 (mod r) with 
= 

(a) Give the univariale analysis of variance tabic for main effects and error 
(including sums of squares, numbers of degrees of freedom ‘ and mean 
squares). 

(b) Give the table for the vector case. 

(c) Indicate in the vector case how to test the hypothesis X, = 0» i = 1. r, 

8.23. (Sec 8*9) Let x ] be the yield of a process and .v : a quality measure. Let 
Zj = 1, z 2 = ± 10° (temperature relative to average) = ±0.75 (relative mea¬ 
sure of flow of one agent), and z 4 = ± 1.50 (relative measure of flow of another 
agent). [See Anderson (1955a) for details.] Three observations were made on .v. 
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and for each possible triplet of values of z 2 , and z 4 . The estimate of P is 

58.529 - 0.3829 - 5.050 2.308^ 

P " (98.675 0.1558 4.144 -0.700 

s l = 3.090, s 2 = 1.619, and r= — 0.6632 can be used to compule S or 2* 

(a) Formulate an analysis of variance model for this sltuatiorL 

(b) Find a confidence region for the effects of temperature (i.e., fi l2 , fi 2 2 ^ 

(c) Test the hypothesis that the two agents have no effect on the yield and 
quantity. 

8.24. (Sec. 8.6) Interpret the transformations referred to in Theorem 8.6.1 in the 
original terms; that Is, W : Pj = and z ( a l) . 

8.25. (See. 8.6) Find the cdf of tr HG~ X for p = 2. [Hint: Use the distribution of the 
roots given in Chapter 13,] 

8.26. (Sec. 8,10.1) Bartlett-Nanda -Pillai V-test as a Bayes procedure• Let 

be independently normally distributed with covariance matrix 
i and means ^Ew t 免 y n i = 1 ， ”. ， m, cc Ew t = 0, / 4- +«. Let n 0 be 

defined by [T,, 2] = [0,(/+ CC)' *], where the pXm matrix C has a density 
proportional to |/4 - CC f \ ^ 7^ m \ and ( 7 P . WJ 7 m >, let be defined by 
[T lfc X] = [( / + CC f )^ l C,(/ + CC0 -1 ] where C has a density proportional to 

l/ + crr^ /I+fl,) e^ lrC，(UCC，r，r . — 

(a) Show that the measures are finite for n>p by showing tr C\I + CC r )^ [ C 
<m and verifying that the integral of | / + CC f \ "is finite. [Hint: Let 
C= (cj,.,.,c m ), Dj = I ^ L ; tm[ c { c f t =E } Ej, c 广 Ej-'d” j= l,...,m(£ 0 = /). 
Show \Dj\ = iZJ^jKl +rf^rf ; ) and hence \D m \ - n^L^l +«)■ Then refei 
to Problem 5.15J 

(b) Show that the inequality (26) of Section 5.6 is equivalent to 



Hence the Bartlett-Nauda-Pillai V-tesi is Bayes and thus admissible. 

8,27. (Sec. 8.10,1) Likelihood ratio lest as a Bayes procedure. Let be 

independently normally distributed with covariance matrix X and means oo£h^ 
= 7 n i = = 0 ， / = m + with n^m +p. Let n 0 be 

defined by [r M S] = [0,( / + CCO" l ], where the pXm matrix C has a density 
proportional to |/+ CC l \ ~ 、 and I"! = let rii be defined by 

[r 1 ,x] = [(/ + cc , )-'cD,(/ + cc-)- 1 ], 
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where the m columns of D are conditionally independently normally distributed 
with means 0 and covariance matrix [I— C r (I + CC0~ l C] _l , and C has (margi¬ 
nal) density proportional to 

(a) Show the measures are finite, [Hint: See Problem 8,26,] 

(b) Show that the inequality (26) of Section 5.6 is equivalent to 

IO;l : 

一 " ZT7T 

. w i w (\ 

Hence the likelihood ratio test is Bayes and thus admissible. 

8.28. (Sec. 8.10.1) Admissibility of the likelihood ratio test. Show that the acceptance 

region |ZZ’|/|ZZ' +XX f \ >c satisfies the conditions of Theorem 8,10 丄 [Hint: 
The acceptance region can be written > c y where m ( =- l — A,, / = 

1 ，…，， •] 

8.29. (Sec, 8.10,1) Admissibility of the Lawley - Hotelling test. Show that the accep¬ 
tance region trZV'CZZO" 1 <c satisfies the conditions of Theorem 8.10.1. 

8J0, (Sec, 8‘10,1) Admissibility of the Bartlett-Nanda-Pillai trace test ‘ Show that the 
acceptance region trX'iZZ' +XX r )^ l X <c satisfies the conditions of Theorem 
8 . 10 . 1 . 

8J1, (Sec, 8.10.1) Show that if A and B are positive definite and 4 一及 is positive 
semidefinite, then is positive semidefmite. 

8J2. (Sec ‘ 8.10*1) Show that the boundaiy of A has m-measure 0. [Hint: Show that 
(closure of A) C.A u C, where C = {V\U — ¥¥ r is singular}.] 

8J3. (Sec. 8.10.1) Show that if AcR^ is convex and monotone in majorization, 
then A* is convex. [Mine Show 

{px+qy)i> w px t +qy iy 

where 

Z 丄 =( % 卜 … ， ^m])' G ^ < -] 

8*34, (Sec. 8*10.1) Show that C(X) is convex. [Hint: Follow the solution of Problem 
8.33 to show (px ~^qy)< w \ if x< w \ and y -< w \.] 

8*35. (Sec, 8.10*1) Show that if A is monotone, then is monotone* [Hint: Use 
the fact that 

^[<c]= max {min(^ i ,...,^ )i )}.] 
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8«36. (Sec* 8.10*2) Monotonicity of the power function of the Bartlett-Nanda-Pillai 
trace test. Show that ^ 

tr'(ww , +B)(uu' +B + W)~ X <K 

is convex in u for fixed positive semidefinite B and positive definite B + W \i 
0 <K < L [Hint: Verify 

(ww f + 

=(B + ff) --- —(B + W) ''uu'(B + W)~ l . 

l+u'(B + Wy'u 

The resulting quadratic form in u involves the matrix {\x A)l- A for A = 
(£P + wy + W)~ show that this matrix is positive semldefinite by diago¬ 
nalizing A] 

8,37, (Sec* 8 . 8 ) Let x ( J'\ a = 1，…，％， be observations from X)， v = 1，…，^ 

What criterion may be used to test the hypothesis that 

m 

E 7 ;,c^+ ( 1 , 

h- 1 

where c hv arc given numbers and 7 ,,, |jl are unknown vectors? [ Note ： This 
hypothesis (that the means lie on an m-dimensional hyperplane with ratios of 
distances known) can be put in the form of the general linear hypothesis.] 

8J8, (Sec, 8,2) Let x a be an observation from MPz tt ， 2)，a = Suppose 

there is a known fixed vector 7 such that P 7 = 0, How do you estimate p? 

8J9. (Sec 、 8 » 8 ) What is the largest group of transformations on y^\ a = 1 ，…， 
f = 1，，、、，<?，that leaves (1) invariant? Prove the test (12) is invariant under this 
group 、 



CHAPTER 9 


Testing Independence of 
Sets of Variates 


9 丄 INTRODUCTION 

In this section we divide a set of p variates with a joint normal distribution 
into q subsets and ask whether the q subsets are mutually independent; this 
is equivalent to testing the hypothesis that each variable in one subset is 
uncorrelated with each variable in the others. We find the likelihood ratio 
criterion for this hypothesis, the moments of the criterion under the null 
hypothesis, some particular distributions, and an asymptotic expansion of the 
distribution. 

The likelihood ratio criterion is invariant under linear transformations 
within sets; another such criterion is developed* Alternative test procedures 
are step-down procedures, which are not invariant, but are flexible. In the 
case of two sets, independence of the two sets is equivalent to the regression 
of one on the other being 0; the criteria for Chapter 8 are available. Some 
optimal properties of the likelihood ratio test are treated. 


9.2. THE LIKELIHOOD RATIO CRITERION FOR TESTING 
INDEPENDENCE OF SETS OF VARIATES 

Let the p-component vector X be distributed according to N{\i, We 
partition X into q subvectors with Pi ， P 2 ” • 、， P q components、respectively: 
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that is. 


The vector of means [i and the covariance matrix 2 are partitioned similarly, 


^ \ 
. ! 

J 


,j 




… 


S 2I 

S 22 … 



^q2 … 

^99 


The null hypothesis we wish to test k that the subvectors are 

mutually independently distributed, that is, that the density of X factors into 
the densities of X {q \ It is 

(4) = n^’WW;). 

I = 1 

If are independent subvectors, 

(5) S(X (,) - - 〆，=>；："= 0. I 外 

(See Section 2A) Conversely, if (5) holds, thin (4) is true* Thus the null 
hypothesis is equivalently 0, i ^]\ Thi，can be stated alternatively as 

the hypothesis that 2 is of the form 

o … o Y 

0 S” 、" 0 


0 0 


Given a sample of N observations on X, the likelihood ratio 
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criterion is 

⑺ 

where 

( 8 ) 






(2tt) ， 5：| + 


5 (^-^)^ .'(h-H) 


and L((a, 2。）is L(|x, 2) with 2" * 0, i #/, and where the maximum is taken 
with respect to all vectors p. and positive definite X and 2 0 (i.e., 2〃). As 
derived in Section 5*2, Equation (6), 


(9) 

where 

( 10 ) 


rnaxLfa, X) - rrr-T — 

M (27rni 


， 'ip n 


^ (〜一爻 ）（& 一无） '• 


Under the null hypothesis ， 


(11) 

where 


叫， s 0 ) = inu 和 m 


(12) 

Clearly 




(13) 




=n 


_1_ 

(2 十 N n? =l H〆 


，- \P,N 


■\PN 


where 

(14) 


%u. = ^ E 护 ))' 
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If we partition A and as we have 2, 




^12 *' 

'• A ^\ 


f a 

i ,2 

… 、 

(15) 

^21 

^22 ‘‘ 

.‘ 〜 

A 

， = 

玄 21 

^22 



A 1 

* 

** A ^i 



K 



we see that = 2" = (l/N)A^ 
The likelihood ratio criterion is 


maVhMP ，艺 o) — li n l^ I 

( — max 卜 , — 

The critical region of the likelihood ratio test is 
(17) \<\(e), 

where 八 (e) is a number such that the probability of (17) is e with 2 = 2 0 . (It 
remains to show that such a number can be found.) Let 


H8) 


v -nBo- 


Then A = V'^ N is a monotonic increasing function of K The critical region 
(17) can be equivalently written as 

(19) V< V{e). 

Theorem 9.2.1. Let x v ... y x N be a sample of N observations drawn from 
N(\x y 2 )，where x ay |x, and 2 are partitioned into p u … ， p q rows (and columns 
in the case of as indicated in (1) ，（ 2 )，and (3 人 The likelihood ratio criterion 
that the q sets of components are mutually independent is given by (16 )，where A 
is defined by (10) and partitioned according to (15). The likelihood ratio test is 
given by (17) and equivalently by (19 )，where V is defined by (18) and \(e) or 
V(e) is chosen to obtain the significance level e. 


Since r” = a,/ yja u a l} ， we have 
(20) 141 -\R\ 
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where 





R \2 

心 1 

(21) 



R 22 … 



V 

[ R ^ 

R ql … 

V 

and 








P] + + Pl 


(22) 

\^n\ = \ R „\ 

n 

a>> ‘ 




/?] + •• +p t -i +i 

Thus 





(23) 


\A\ 

n\A, 

\R\ 

j n\Rj 



That is, V can be expressed entirely in terms of sample correlation coeffi， 
dents. 

We can interpret the criterion V in terms of generalized variance* Each 
set x {N ) can be considered as a vector in A^-space; the let (x l{ - 

x lN -x f )^ z iy say, is the projection on the plane orthogonal to the 
equiangular line. The determinant \A\ is the p-dimensional volume squared 
of the parallelotope with q ，…， as principal edges. The determinant \A n \ 
is the p-dimensional volume squared of the parallelotope having as principal 
edges the ith set of vectors. If each set of vectors is orthogonal to each other 
set (i*e., /f= 0, i ^/), then the volume squared \A\ is the product of the 
volumes squared For example, if p^2 y P\-p 2 ^ this statement is 
that the area of a parallelogram is the product of the lengths of the sides 
if the sides are at right angles. If the sets are almost orthogonal, then \A\ 
is almost Yl\A sf \ y and V is almost 1. 

The criterion has an invariance property. Let C, be an arbitrary nonsingu¬ 
lar matrix of order p t and let 

C, 0 … 0 、 

0 c … 0 

(24) C=. .. 

0 0 … c q 

q 

Let Cx a + d = x*. Then the criterion for independence in terms of x* is 
identical to the criterion in terms of x a . Let A* - L a (x^ -x*Xx* — be 
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partitioned into submatrices A* r Then 


(：5) = 

a 

=c, I ： «-#))( 々 >- 

a 


«, c : 

and A*=CAC\ Thus 


(26) 


\A*\ \CAC'\ 


\C\-\A\ -\C'\ \A\ 

nic,i-|/ij-ic;i - niAj 


for \C\ - Tl\C f \. Thus the test is invariant with respect to linear transforma¬ 
tions within each set. 

Narain (1950) showed that the test based on V is strictly unbiased; that is, 
the probability of rejecting the null hypothesis is greater than the significance 
level if the hypothesis is not true, [See also Daly (1940).] 


93. THE DISTRIBUTION OF THE LIKELIHOOD RATIO CRITERION 
WHEN THE NULL HYPOTHESIS IS TRUE 


9*3.1. Characterization of the Distribution 

We shall show that under the null hypothesis the distribution of the criterion 
V is the distubution of a product of independent variables, each of which has 
ihe distribution of a criterion U for the linear hypothesis (Section 84). 

Let 


( 1 ) 


^11 


^\t 





-4,..-I 



An 




- 


•|4J 


l-j j 




1 = 2,… ， q, 
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Then V= … 匕 . Note that V x is the N/2th root of the likelihood ratio 
criterion for testing the null hypothesis 

(2) H t = 0 ,…， = 0, 

that is, that X (l) is independent of ( 尤⑴ ’”" ，尤 ( ’ _1 ”)’. The null hypothesis 
H is the intersection of these hypotheses. 

Theorem 9.3.1. When H i is true, V i has the distribution of , where 

n = N — l and 戶 + …一 ！， / = 2, 

Proof. The matrix A has the distribution of E^ s=1 Z a Z r a9 where Z u ^. 9 Z n 
are independently distributed according to N(0, 2) and Z a is partitioned as 
(Zf ，， … ， Z^y. Then conditional on zf =4” ， … ， Z^” 卜 n , a = 
the subvectors are independently distributed, Z ( J } hav¬ 

ing a normal distribution with mean 

(3) P, ： . 


and covariance matrix 

(4) U : 

where 

^ 2 n … Du 

( 5 ) Pi = di … •: •: 

^ 艺卜 1,1 … 艺 i_I “一 I 

When the null hypotheris is not assumed, the estimator of p, is (5) with 
replaced by A jk ，and the estimator of (4) is (4) with replaced by (1 /n)A jk 
and p, replaced by its estimator. Under H ( : p. = 0 and the covariance matrix 
(4) is 2“，which is estimated by (1/«M /V . The N/2th root of the likelihood 
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ratio criterion for H t is 


( 6 ) 




… 1 




A/ (Al ，…， *^1 ， 卜 I ) 


•: 


• 






/i-i.' J 





^11 


/! h 


a ， -u-\ 



… a ，,,~\ 




… - 1 



:• 


'\ A u\ 


^i-i j 




which is V r This is the "-statistic for p f dimensions, p i components of the 
conditioning vector, and n _p t degrees of fieedom in the estimator of the 
covariance matrix. 鼸 


Theorem 93.2. The distribution of V under the null hypothesis is the 
distribution of … K" ^here V 2 ，…， V q are independently distributed with V t 
having the distribution of U Pit p it n — 応 ， ^here Pi=p x + … +p, 一卜 

Proof. From the proof of Theorem 93.1, we see that the distribution of V f 
is that of U Pit p itn ^p i not depending on the conditioning z ( ^\ k = 1，…， i — 1 ， 
o ： =l ， ... ， n. Hence the distribution of V f does not depend on 


Theorem 93.3. Under the null hypothesis V is distributed as n^ii X ”， 
where the X-^s are independent and X i} has the density j8[x| ^(n — 

jXiPX 

Proof. This theorem follows from Theorems 9.3.2 and 8.4.1. 鼸 


93.2. Moments 

Theorem 93.4. When the null hypothesis is true，the hth moment of the 
criterion is 


(7) 


^ = ； U\M r[i(«-A + i- ； )]r[l(« + i-/) + /,]/• 
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Proof. Because are independent, 

(8) SV ix = 

Theorem 9.3.2 implies c^V x h = Then the theorem follows by 

substituting from Theorem 8A3. ■ 

If the p t are even, say p t = 2r n i > 1 , then by using the duplication formula 
r(a -f 姜 )I'(a + 1 ) = V^rF( 2 a -f 1 ) 2 " 211 for the gamma function wc am re- 
auce the hth moment of V to 






n n 


r(^ 4- 1 —p { 一 2A: + 2/t)F(/? 4- ] - 2k) \ 
r(w + 1 —氏一 2k)T(n 4- 1 - 2A: 4- 2/?) J 


n n s — l ( n+l -反 


■ 2 l Jt = 1 


■ n xn + i-p,-2k + 2i,-i^- x y^~ l dx 
J 0 


Thus V is distributed as where the Y lk are independent, and 

Y fk has density fi(y; n + 1 —艮 _ 2k ， p t ) t 

In general, the duplication formula for the gamma function can be used to 
reduce the moments as indicated in Section 8 A 


93.3. Some Special Distributions 

If ^ = 2, then V is distributed as U 卜 p 、卜 pr Special cases have been treated 
in Section 8.4, and references to the literature given. The distribution for 
Pi =P 2 = P 3 = 1 匕 given in Problem 9.2, and for = p 2 = 2 in Problem 
93. Wilks (1935) gave the distributions for pj =p 2 = I， for = 2 / for 

p! = 1 ， =p 3 = 2， for py = 1, p 2 = 2 t P 3 = 3, for pi = 1, pi = 2 y -= 4, and 
for pj = p 2 = 2, p 3 = 3. Consul (1967a) treated the case p' = 2 ， p 2 = 3 T 
even. 

Wald and Brookner (1941) gave a method for deriving the distribution if 
not more than one p i is odd. It can be seen that the same result can be 
obtained by integration of products of beta functions after using the duplica¬ 
tion formula to reduce the moments. 

Mathai and Saxena (1973) gave the exact distribution for the general case. 
Mathai and Katiyar (1979) gave exact significance paints far p - 3( 010 <ind 
n = 3(1)20 for significance levels of 5% and 1% (of -k\ogV of Section 9.4). 


+ In Wilks's form ila r[+( N -2 - /)] should be T[^(n -2 - 0], 
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9.4. AN ASYMPTOTIC EXPANSION OF THE DISTRIBUTION OF THE 
LIKELIHOOD RATIO CRITERION 

The /， tli momem of A = is 



= K nMm[N(i + /p—/]} 

1 ’ n ^,( n ;； l r { i [^( i +/ l )-/]}； 


where K is chosen so that A n = 1. This is of the form of (1) of Section 8.5 
with 



^ ~ P j b = p' ~ 2 ， = ~ 2 ~ 1 ^ 一 1 ”.‘，戶， 

N ~j +Pi + **• +/?,_, 

•v， T . v ; = - 2 - ， 

;=Pi + ■■■ +p,_! 4- + ■■■ +p tl t= 


Then /= ^[p{p + 1) - Ep,(p, + D] = i(p 2 - Ep, 2 ), P k = e ; - ^(1 - p)iV. 

In order to make the second term in the expansion vanish we take p as 


(3) 


Let 

㈧ 


_ i _ 2( / 户一 Up, 3 ) + 9(p 2 - T.pf) 
9 6N(p 2 - Lpf) 


k = pN = N ~ 


3 

2 




Then — y 2 /fc 2 , where [as shown by Box (1949)] 


Hp 2 -^) (p 3 -[a 3 ) 2 

48 % 72(p 2 _Zpfy 


We obtain from Section 8.5 the following expansion ： 

(6) Pr{ - k log V<u) = Pr[ \] ^ y ) 

+ 吾 [ Fr U/“ < 十 Pr U/ ㈣ 卜 0(k~^). 
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Table 9.1 

Second 


p f v y 2 N k y 2 /^ 2 Term 


6 

12.592 


15 

71 

T 

0.0033 

-0.0007 

10 

18.307 

15 

15 

69 

T 

0.0142 

- 0.0021 

15 

24.996 

235 

~48" 

15 

67 

00393 

-0.0043 




16 

71 

T 

0.0331 

-0.0036 



Table 9.1 gives an indication of the order of approximation of (6) for 
Pi = 1. In each case v is chosen so that the first term is 0.95. 

If q = 1 、 the approximate distributions given in Sections 8.5.3 and 8.5.4 are 
available. [See also Nagao (1973c)J 


9,5. OTHER CRITERIA 

In case q = 2 9 the criteria considered in Section 8,6 can be used with G -h H 
replaced by A u and H replaced by ^ 12 ^ 22 ^ 2 ^ or G + replaced by A 22 
and H replaced by A 2l A[ l l A n . 
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The null hypothesis of independence is that 2 — 2 0 = 0， where 2 0 is 
defined in (6) of Section 9.2.. An appropriate test procedure will reject the 
null hypothesis if the elements of A - A 0 are large compared to the elements 
of the diagonal blocks of A Q (where A 0 is composed of diagonal blocks A i{ 
and off-diagonal blocks of 0). Let the nonsingular matrix B h be such that 
B il A xl B l ii = /, that is, A^ 1 and let B 0 be the matrix with B fi as the 

/th diagonal block and O’sas off-diagonal blocks. Then B 0 A 0 B l 0 - 1 and 

^11^12^22 /f| I A^ 

0 … ^22^2q^'qq 

This matrix is invariant with respect to transformations (24) of Section 9.2 
operating on A. A different choice of B “ amounts to multiplying (1) on the 
left by Q 0 and on the right by Q l 0 , where Q 0 is a matrix with orthogonal 
diagonal blocks and off-diagonal blocks of 0’s. A test procedure should reject 
the null hypothesis if some measure of the numerical values of the elements 
of (1) is too large. The likelihood ratio criterion is the N/2 power of 
|B 0 U-/l 0 )^ + /| ^\B 0 AB' 0 \. / * 

Another measure, suggested by Nagao (1973a), is 

(2) \tT[B 0 (A-A 0 )B' 0 ] 2 -\tT[(A-A 0 )A^] 2 ^\tT(AA^ -if 

= 5 I ： nA t) A^A-A~\ 

，，尸 1 

Foi q-2 this measure is the average of the Bartlett — Nanda—Pillai trace 
criterion with G + /f replaced by A u and H replaced by y4 l2 /4j 2 l y4 21 and the 
same criterion with G + H replaced by A 22 and H replaced by A 2l A^ l A n . 

This criterion multiplied by n or N has a limiting ^^distribution with 
number of degrees of freedom /= {(p 2 — pf), which is the same 
number as for -N log K. Nagao obtained an asymptotic expansion of the 
distribution: 


^ 2 ^ 21^11 

⑴ B 0 (A-A 0 )B^ 


(3) Pr||n tr(/L4g 1 

= Pr(^- 2 <x} 


+ 


q \ 


To P 3 ~ 3 P L P , 2 + 2 L Pf Pr { Xf+6 
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+ 


一 V + 4plip ； -2llpf~p 2 + £ p} Pr{ Xf + A <x 


q \ 


+ - a \ p 3 - pLp ： + P 2 ~ Ea ' P r {^/ + 2 


24 


2 ^ 3 - 2 E pf + 3p : - 3£p ( 2 Pr(^/<.r) 


+ 0(n~ 2 ). 


9.6. STEP-DOWN PROCEDURES 
H.6.I. Step-down by Blocks 

It was shown in Section 93 that the N/2th root of the likelihood ratio 
criterion, namely V ， is the product of q - I of these criteria，that is ， 
% Kf ‘ ith subcriterion V x provides a likelihood ratio test of the 
hypothesis H i [(2) of Section 9.3] that the ith subvector is independent of the 
preceding i - 1 subvectors. Under the null hypothesis H [- these 

q- \ criteria are independent (Theorem 9.3.2), A step-down testing proce¬ 
dure is to accept the null hypothesis if 

(1) V } ^v } {e x ), i = 2q ‘ 

and reject the null hypothesis if V x < v^s) for any i. Here v x {e t ) is the 
number such that the probability of (1) when H x is true is 1 - e r The 
significance level of the procedure is e satisfying 

(2) 1 - e = n(l - e ,) - 

i^2 

The subtests can be done sequentially, say, in the order 2, ，■ • ，分 • As soon as a 
sub test calls for rejection, the procedure is terminated; if no sub test leads to 
rejection, H is accepted. The ordering of the subvectors is at the discretion 
of the investigator as well as the ordering of the tests. 

Suppose, for example, that measurements on an individual are grouped 
into physiological measurements, measurements of intelligence, and mea¬ 
surements of emotional characteristics. One could test that intelligence is 
independent of physiology and then that emotions are independent of 
physiology and intelligence, or the order of these could be reversed. Alterna¬ 
tively, one could test that intelligence is independent of emotions and then 
that physiology is independent of these two aspects, or the order reversed. 
There is a third pair of procedures. 
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Other*criteria for the linear hypothesis discussed in Section 8.6 can be 
used to test the component hypotheses H 2 , … ， H q in 这 similar fashion. 
When is true, the criterion is distributed independently of X ( a l \ X^ l \ 
a- 1__ N, and hence independently of the criteria for H 2 , .“， 


9.6.2. Step-down by Components 


In Section 8.4.5 we discussed a componentwise step-down procedure for 
testing that a submatrix of regression coefficients was a specified matrix. We 
adapt this procedure to test the null hypothesis H i cast in the form 



where 0 is of order p t X p r The matrix in (3) consists of the coefficients of the 
regression of X ({) on (X ⑴、 … ， 0 / . 

For i = 2, we test in sequence whether the regression of X pi + 1 on X (l) = 
X px ) 1 is 0, whether the regression of X px+2 on X (l) is 0 in the 
regression on A ■⑴ and X Px + l ,..., aid whether the regression of 

A,% on is 0 in the regression of ^ Pl+ p : on X iX \ X pi+V ,..., X px+p ^ v 
These hypotheses are equivalently that the first, second”，，，and p 2 th rows 
of the matrix in (3) for i = 2 are 0-vectors. 

Let be the kxk matrix in the upper left-hand corner of A fi , let A\ k } ] 
consist of the upper k rows of A i; , and let consist of the first k columns 
of A )f , k = 1 ， …， p r Then the criterion for testing that the first row of (3) is 0 
is 
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For fc > 1， the criterion for testing that the kth row of the matrix in (3) is 0 is 
[see (8) in Section 8,4] 



X tk ' 


4( 方 > — (/1綷）… A ik) 




A. 


A. 


l.i 


\ -» 



A(k) 

\ 

1 

-1 








A 


u 


1 




14M 


A n … 

■ 


< 


* 


d 


… 

4(k) 


id 

A u *•' 


4(H) 


A- 1,1 … 


凡 i-Ui 


4” … 

-】 

A^ l) 

* ii 



k = 2, T * I , i = 2 , r • « , ^ r 


Under the null hypothesis the criterion has the beta density p[x；^(n -氣 + 1 
^j),{pi] r For given /, the criteria X tp are independent (Theorem 

8AA\ The sets for different i are independent by the argument in Section 
9,6 丄 

A step-down procedure consists of a sequence of tests based on 
X 2] X 2 , X n ,,,., X ( . A particular component test leads to rejection if 


( 6 ) 


1 — X” rt - p, + 1 — j 




>F p,n~ 






The significance level is e, where 


1 




<i Pi 


n n (i - 

；«2 1 



⑺ 
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The sequence of subvectors and the sequence of components within each 
subvector is at the discretion of the investigator. 

The criterion for testing H ( is — n^L] X ik ，and criterion for the null 
hypothesis H is 

(8) f[y,= fi 

/ = 2 « = 2 k=l 

These are the random variables described in Theorem 9.3,3, 


9.7. AN EXAMPLE 


We take the following example from an industrial time study [Abruzzi 
(1950)], The purpose of the study was to investigate the length of time taken 
by various operators in a garment factory to do several elements of a pressing 
operation. The entire pressing operation was divided into the following six 
elements: 

L Pick up and position garment. 

2. Press and repress short dart. 

3. Reposition garment on ironing board, 

4， Press three-qyBrters of length of long dart, 

5， Press balance of long dart ， 

6. Hang garment on rack. 


In this case x a is the vector of measurements on individual a. The compo¬ 
nent x ia is the time taken to do the ith element of the operation. N is 76, 
The data (in seconds) are summarized in the sample mean vector and 
covariance matrix: 


⑴ 


(2) 



' 9.47 1 
25.56 
13.25 
3144 
27.29 
8.80 ( 


2.57 0.85 1.56 

0.85 37.00 3.34 

1.50 3.34 8.44 

1.79 13.47 5.77 

1.33 7,59 2.00 

0.42 0.52 0.50 


1,79 

1,33 

0.42 

13.47 

7.59 

0.52 

5.77 

2.00 

0.50 

34.01 

10,50 

1.77 

10.50 

23.01 

3.43 

1.77 

3.43 

4,59 
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0.173 

0.123 

0,262 

0.040 

0.144 

0,080 

0.375 

0.142 

1,000 

0.334 

0.334 

1,000 


(3) 


R 


1.000 

0.088 

0.334 

0.088 

1.000 

0.186 

0.334 

0.186 

1.000 

0,191 

0.384 

0,343 

0.173 

0,262 

0.144 

0.123 

0.040 

0.080 


The sample standard deviations are (1.604,6,041,2*903,5.832,4.798.2.141), 
The sample correlation matrix is 


The investigators are interested in testing the hypothesis that the six 
variates are mutually independent It often happens in time studies that a 
new operation is proposed in which the elements are combined in a different 
way; the new operation may use some of the elements several times and some 
elements may be omitted. If the times for the different elements in the 
operation for which data are available are independent, it may reasonably be 
assumed that they will be independent in a new operation. Then the 
distribution of time for the new operation can be estimated by using the 
means and variances of the individual item、 

In this problem the criterion V is V -\R\ - 0.472, Since the sample size is 
large we can use asymptotic theory: fc = 孕， /= 15， and -k log 54,1. 
Since the significance point for the ^^distribution with 15 degrees of 
freedom is 30,6 at the 0,01 significance level, we find the result significant. 
We reject the hypothesis of independence; we cannot consider the times of 
the elements independent. 


9.8. 


CASE OF TWO SETS OF VARIATES 


In the case of two sets of variates {q = 2), the random vector X 、 the 
observation vector x a , the mean vector jt, and the covariance matrix 2 are 
partitioned as follows: 


⑴ 


X 




^⑴) 

> X a ~ 


\ 

\ 

V)' 

〆), 



x' 


2- 


12 


2 

艺 22 


The null hypothesis of independence specifies that 2 12 = 0, that Is. that X is 
of the form 


14 3 0 5 2 
9 & 4 o 7 4 
1 3 3 o 3 .1 

o,00 ( 1 .o.0. 
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The test criterion is 




V- 


\A\ 




It was shown in Section 9.3 that when the null hypothesis is true，this 
criterion is distributed as the criterion for testing a hypothesis 

about regression coefficients (Chapter 8). Wa now wish to study further the 
relationship between testing the hypothesis of independence of two sets and 
testing the hypothesis that regression of one set on the other is zero. 


The conditional distribution of given =jc^ 2) is ⑴ + - 

n (2) )，X n . 2 ] = - x [2) ) + V, X U 2 ], where p = Ug，X u . 2 = 

— X 12 XgX 2| ，and v = 〆” + PU (2) - n， (Z) X Let X* =X^\ zf = [UL 2) — 
1], P* = (p v), and X* = X n 2 . Then the conditional distribution of X* 
is Mp* z*, X*). This is exactly the distribution studied in Chapter 8. 

The null hypothesis that 2, 2 — 0 is equivalent to the null hypothesis P = 0. 
Considering fixed, we know from Chapter 8 that the criterion (based on 
the likelihood ratio criterion) for testing this hypothesis is 


(4) 

where 


Lh 


__ 


^ 2) = 1 , 

(5) l = v=i*=i( l ), 
fe = (Pto P^n) 

= {A l 2 A- 72 l x<^). 

Tlie matrix in the denominator of U is 








a 




(6) 
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The matrix in the numerator is 

⑺ 会卜 L” _i (1 ) [<)- 妒 ) -A n A- n \x^-x^)}' 

a=*l 

= / ^ll - 412 422^21 • 

Therefore, 


I 乂 II — 乂 12 ^22 ^211 


I ⑷ 

* 1^221 


which is exactly V. 

Now let us see why it is that when the null hypothesis Is true the 
distribution of U = V does not depend on whether the are held fixed It 
was shown in Chapter 8 that when the null hypothesis is true the distribution 
of U depends only on /?, q u and N — q 2 , not on z a . Thus the conditional 
distribution of V given does not depend on x ( ^\ the joint distribu¬ 

tion of V and is the product of the distribution of V and the distribution 
of X ( J-\ and the marginal distribution of V is this conditional distribution. 
This shows that the distribution of V (under the null hypothesis) does not 
depend on whether the are fixed or have any distribution (normal or 
not). 

We can extend this result to show that if q>2, the distribution of V 
under the null hypothesis of independence does not depend on the distribu¬ 
tion of one set of variates, say Xf. We have V=V 2 •** V qt> where V t is 
defined in (1) of Section 9,3. When the null hypothesis is true ， V q is 
distributed independently of A^ 1 )，by the previous result. In turn 
we argue that is distributed independently of < 广 Thus 

K 2 … is distributed independently of 

Theorem 9.8.1. Under the null hypothesis of independence, the distribution 
of V is that given earlier in this chapter if q — \ sets are jointly normally 
distributed, even though one set is not normally distributed. 

In the case of two sets of variates, we may be interested in a measure of 
association between the two sets which is a generalization of the correlation 
coefficient. The square of the correlation between two scalars X x and X 2 
can be considered as the ratio of the variance of the regression of X x on X 2 
to the variance of X x \ this is Y{ (3X 2 )/ / y{X l ) = /3 2 o- 22 /o- u = (^ 12/^22 V^u 
= pf 2 . A corresponding measure for vectors X (l) and X (2) is the ratio of the 
generalized variance of the regression of X il) on X (2) to the generalized 
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variance of X^ l \ namely, 

\ m a ) m i 2 ) y \ iplp'i ms%, 


If p } = p 2 , the measure is 

( 10 ) 

In a sense this measure shows how well X (]) can be predicted from X (2 \ 

In the case of two scalar variables X x and X 2 the coefficient of alienation 
is o-^ 2 /o-, 2 , where o-^ 2 ^ S > (X ] — pX 2 ) 2 is the variance of X x about itr» 
regression on X 2 when — and = /iX 2 - In the case of 

two vectors X ([) and X (2 \ the regression matrix is p = ^^ 22 ^ and the 
generalised variance of A" ⑴ about its regression on X i2) is 

(ii) 

I 啦⑴ - 网⑵ )( 妒 )- PA ： ( 2 )) ， j| = |2 n - UK 2 J = |^. 

Since the generalized variance of A" (1) is | = |2 n |, the vector 

coefficient of alienation is 

1^11 ~ ^ 12^22 ^21 I _ I 2 | 

U J 一 l2 n N2 22 | . 




ill l^nl 


( — 1) 


p\ 


0 ^12 

艺 21 ^-22 


|S ( , 


IU 


Hi M^22 I 


The sample equivalent of (12) is simply V. 

A measure of association is 1 minus the coefficient of alienation. Either of 
these two measures of association can be modified to take account of the 
number of components. In the first case, one can take the p A th root of ⑼; in 
the second case, one can subtract the /?】th root of the coefficient of 
alienation from 1. Another measure of association is 


…、 打4网( 2 )(晔( 2 ))'](/^(”'广 

⑼ p = —p ' 

Tli 圮 measure of association ranges between 0 and 1. If X (]) can be predicted 
exactly from for p } <, p 2 (te M S ll 2 = 0), then this measure is L If no 
linear combination of X (1 ) can be predicted exactly, this measure is 0, 
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9.9. ADMISSIBILITY OF THE LIKELIHOOD RATIO TEST 

The admissibility of the likelihood ratio test in the case of the 0-1 loss 
function can be proved by showing that it is the Bayes procedure with respect 
to an appropriate a priori distribution of the parameters. (See Section 5.6.) 

Theorem 9.9,1. The likelihood ratio test of the hypothesis that 2 is of the 
form (6) of Section 9,2 L、' Bayes and admissible if N > p ^ 1. 

Proof' We shall show that the likelihood ratio test is equivalent to rejec¬ 
tion of the hypothesis when 


//(x|0)n ( (j0) 

(1) - w ， 

//(x|0)n (1 (j0) 

where x represents the sample, 0 represents the parameters (jt and 2)， 
/(x|6) is the density, and n t and 1—1 ^ are proportional to probability mea¬ 
sures of 0 under the alternative and null hypotheses, respectively, Specifi¬ 
cally, the left-hand side is to be proportional to the square root of n^，i \A„\ / 
\A\. 

To define lip let 

(2) \L = {i+wy x vY, s = (/+ wy\ 

where the p-component random vector V has the density proportional to 
(1 + v f vY n = N - l, and the conditional distribution of Y given V — v is 
N"[0, (1 + v r v)/N\ Note that the integral of (1 4* v'vY ^ is finite \f n> p 
(Problem 5.15). The numerator of (1) is then 


(3) const 


U + vv'l 1 ^ 


1 N 

■exp( - 2 E [x a - (/+rr')" V] ， ( /+rt ，， )[ ;c a- (^+ 




.(1 +rV)、 n .(l +rV)~ T expj 1 l dvdy 
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The exponent in the integrand of (3) is — 2 times 

V N w 2 

( 4 ) L^(I + ^)x a -2yv- 1 ： x a -YNy 2 v\l^vv'y\+YV^ 

a = I a—1 

N N 

=E x' a x a + v' 52 2yv'Nx + Ny 2 

a a 1 a ^ ] 

=tx A + v'Av + Nx f x ^ N(y- x f v) 2 , 

where A = L'^ l x a x r a - Nxx\ We have used r ; (/+ uv'Y^v + (1 + rV)"* 1 = 1. 
[from (/ + vv'Y 1 ==/ — (1 + v r vy ] vv']. Using \I + w f \ 1 -f v ! v (Corollary 

A,3,l), we write (3) as 


(5) const e~ ^ lr **' ^ f f e~^ r Al ' dv = const|/4| ~ t 

— OC "— 30 

To define Tl 0 let 2 have the form of (6) of Section 9.2. Let 

(6) 1#、，2„] = [(7 + F (，) F (')，）- V ( ') y ,，(7 + F ( f ) F (')，） -1 ]， i = 

where the /7,-component random vector V 0) has density proportional to 
(1 + t ， ⑴ V 1 )) 一 V 1 ，and the conditional distribution of Yf given V (l) = v (i) is 
A r [0,(l v Cl)} v {t) )/N\ and let (V v Y { \.. .,(^,5^) be mutually independent. 

Then the denominator of (1) is 

(7) PI const|/4,J "" exp[ — 士 (tr /!" + /Vx 0) 'jc (,) )] 

I ** 1 ' 

9 ,\ 

=const O 1/4, ,| exp[ -4(tr/l + Nx^jE)]. 

\^i / ' 

The left hand side of (1) is then proportional to the square root of 
This proof has been adapted from that of Kiefer and Schwartz (1965). 


9.10. MONOTONICITY OF POWER FUNCTIONS OF TESTS OF 
INDEPENDENCE OF SETS 

Let Z a = [Z^ n， , Z ( ~ )r ]', a = 1,.,n, be distributed according to 

艺 II ^12 

、 Z 2l z 22 


( 1 ) 


N 
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We want to test H : 2 12 = 0_ We suppose /?, < p 2 without loss of generality. 
Let p v …， p Pi (p, > ••- > p pi ) be the (population) canonical correlation 
coefficients. (The p, 2， s are the characteristic roots of Chap¬ 

ter 12.) Let i? = diag( p,,. ■p Pl ) and A = [i?,0] (p x X p 2 ). 

Lemma 9.10.1. There exist matrices (/?, X/?,), B 2 (p 2 Xp 2 ) such that 

(2) 忍 B 2 ^22^2 = = A . 


Proof. Let m =p 2 , B - B v F' = ^\ 2 B' 2 , 5 = in Lemma 8.10.13. 

Then F'F = I p ? Bj2i 2 ^2 = B^F = A. ■ 


(This lemma is also contained in Section 12.2.) 

Let ^ = ^,2^, y a = B 2 z ( ^\ a=l,...,n, and A"= (ac,,...,ac n ), Y = 
( 3 ^ ， ... ， _y n ). Then (x’ a ， 乂 , ） ’ ， a= are independently distributed ac¬ 

cording to 


(3) 





The hypothesis //: S l2 = 0 is equivalent to H: A = 0 (i.e., all the canonical 
correlation coefficients p v ..,, p pi are zero). Now given F, the vectors jc a , 
a = 1,..., n, are conditionally independently distributed according to 
N{Ay a , /- A A ’） = N(Ay a , I - R 2 ). Then jc* = {I p ^~ R 2 )~ ^jc a is distributed 
according to N{My a , I p ) where 

M = {D,0), 

(4) D = diag(6,,..., 

S , = P,/(1-〆)' t = l,...,p v 

Note that S 产 is a characteristic root of s i2^22 ^21 ^n-2. where 艺 11.2 = 艺 11 
— 2 12 2 22* S 2i ■ 

Invariant tests depend only on the (sample) canonical correlation coeffi¬ 
cients r, = yjc'^ where 

c 广 ) - 1 (;rr)(yY , )- 1 (Kr t, )]. 

S k =X*Y\YY , )~ X YX*\ 

S e =X*X*' 


(5) 
Let 

( 6 ) 
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Then 

⑺ x i (s h s; i )= T ^ 7r 

Now given 广 the problem reduces to the MANOVA problem and we can 
apply Theorem 8.10.6 as follows. There is an orthogonal transformation 
(Section 8.3.3) that carries X* to {U,V) such that S h = UU\ S e = W\ 
= w /j2 ), V is p, X(n - p 2 ) ， u t has the distribution /), 

f = 1”. ■ ， p! (e, being the ith column of l\ and N(0, I\ i=p, + 
and the columns of V are independently distributed according to iV(0, /)• 
Then c p ...»c pi are the characteristic roots of WWV^ 1 , and their distri¬ 
bution depends on the characteristic roots of MYY f M\ say, Now 

from Theorem 8.10.6, we obtain the following lemma. 

Lemma 9.10.2, If the acceptance region of an invariant test is convex in 
each column of U, given V and the other columns of U, then the conditional 
power given Y increases in each characteristic root r { 2 of MVY r M r . 

Lemma 9.10.3. If A > B y then \ { (A) > \ t (B\ 

Proof. By the minimax property of the characteristic roots [see ， e*g., 
Courant and Hilbert (1953 )]， 

(8) K(^) = max min max min = A,(B )， 

where S, ranges over i-dimensional subspaces. ■ 

Now Lemma 9.10.3 applied to shows that for eveiy /, r ; 2 is an 

increasing function of S ( = p f /(l — p?)^ and hence of p，. Since the marginal 
distribution of Y does not depend on the p/s, by taking the unconditional 
power we obtain the following theorem. 

Theorem 9.10.1. An invariant test for which the acceptance region is convex 
in each column of V for each set of fixed V and other columns of V has a power 
function that is monotonically increasing in each p r 

9.11. ELLIPTICALLY CONTOURED DISTRIBUTIONS 
9.1L1. Observations E 】 Iiptical】y Contoured 

Let x' ， … ， x N be N observations on a random vector X with density 

(1) |Ark[(x-v)’A - v )]， 
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where iR A < oo and /? 2 = (x — v)’A 一 - v)_ Then c-X = v and <f{X- 

v)(X-v) f = ^ = (^R 2 /p)A. Let 

(2) ^ E S= E (x a -x)(x a -x)\ 

a— 1 tc= 1 

Then 

(3) /?Vvec(S-2) 4 a/[0,(k+ l )(/ p； + ^)(i ® i ) -h K vcc VCC 

where 1 + k ^p SR 4 /[(p -b2)(^R 2 ) 2 l 

The likelihood ratio criterion for testing the null hypothesis 2 I； = 0, / ^ j\ 
is the jV/2th power of C/ = where V t is the [/-criterion for testing the 

null hypothesis 2 lr -= 0, = 0 and is given by (1) and (6) of Section 

9.3. The form of is that of the likelihood ratio criterion U of Chapter 8 
with X replaced by X (i \ p by p f given by (5) of Section 9.3, Z by 



and 2 by 2 |V under the null hypothesis P, = 0. The subvector is 

uncorrelated with X^\ but not independent of X (0 unless ( 又 AT (0 ’)’ is 
normal. Let 



( 6 ) 

with similar definitions of and We write 

Vi = |G f | /1G ； + H；\ , where 

(7) H,^A (I ”) _ 1 } - 1 ■，） 

=(N- l)5 (, - ,_1) (5 (,_1) )~ 1 S (,_li，) , 

(8) G, . = 

Theorem 9_11 丄 When X has the density (1) and the null hypothesis is true, 
the limiting distribution of H } is W[{1 + where ^ =p l + •- +/'• 】 
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and p } is the number of components ofX iJ \ 

Proof. Since = 0, we have S°S (UI ^ [) = 0 and 

( 9 ) ^ s ,k s lm = (^ + ^km 

if /, / < p t and k, m > p ； or if /, l > p t , and /c, m <and ^s Jk s Im = 0 other¬ 
wise (Theorem 3.6.1), We can write 

(10) /vec^-'- ,) (vec5 (! - ! - 1) ) , = (^ + 7 ^ T ](2 |I ®i (； - 1) ). 

Since S 0 ’'、— p 2 (, ~ 1) and vec5 (,,, ^ 1) has a limiting normal distribution, 
Theorem 9,10,1 follows by (2) of Section 8.4, ■ 

Theorem 9,11.2. Ufider the conditions of Theorem 9.11,1 when the null 
hypothesis is true 

(11) -N\ogV,^(l + K)xl P/ 

Proof. We can write V t = \1 H} and use N log|/ + 1 C| = 

tr C+ O p (N^ 1 ) and 

(12) trflc,) 'h, = N E £ 

= N(vecS il - , ~ l) y ( 士 G,) ' ®S；； [ vec# - 1 ■ 

Because 义⑴ is uncorrelated with 文 (’― 1 ) when the null hypothesis 
H t \ = 0, V t is asymptotically independent of |/ 2 ， . ，，， l^、 1 _ When the 

null hypotheses H 2 , …， are true, V t is asymptotically independent of 
K,_K f ^ r It follows from Theorem 9,10.2 that 

(13) ~N log r = -N^ log Vi Xf, 

where /= Lf* 2 AP. = \[ pip + 1) - p^Pi + Dl- The likelihood ratio test 
of Section 9,2 can be carried out on an asymptotic basis. 
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Let A 0 = diag(/4 u> . A qq ). Then 

(14) \ir(AA^-l) 2 =\ E tvA^AJ^A： 1 

has the ^-distribution when 2 = diag(2 n , ■ • • ， 2^). The step-down proce¬ 
dure of Section 9.6.1 is also justified on an asymptotic basis, 

9.11.2. Elliptically Contoured Mafrix Distributions 

Let Y {p XjV) have the density g{tx YY f ), The matrix Y is vector-spherical ； 
that is, vec Y is spherical and has the stochastic representation vec Y = 
R vec U pXN , where R z = (vec Y)' vec K = tr yy and vec U pxN has the uniform 
distribution on the unit sphere (vec U pxN ) f vec U pxN = 1. (We use the nota¬ 
tion U ffXN to distinguish f om U uniform on the space UU f = I { X 
Let 

(15) X=ve N + CY, 

where A = CC and C is lower triangular. Then X has the density 

(16) |A|" N/2 g[trC- I (^-ve f N ){r-e N v0(C0] _1 

=|Ar w/ \[t r (r-e n v，) v‘)]. 

Consider the null hypothesis 2。 = 0, f 竽 /， or alternatively A ij =l 0, i =^=j, or 
alternatively, R tJ = 0, i j. Then C = diag(C n , … ， C qq ). 

Let M = I 、 一 since M 2 = M, Af is an idempotjnt matrix 
with N — l characteristic roots 1 and one root 0. Then A =XMX r and 
A lt -X a) MX U) \ The likelihood function is 

(17) + 

The matrix A and the vector x are sufficient statistics, and the likelihood 
ratio criterion for the hypothesis H is (U| 广 /2 , the same as for 
normality. See Anderson and Fang (1990b), 

Theorem 9.11.3. Let f{X) be a vector-valued function of X (p XN) such 
that 


(18) 


f{X+vz' N )=f{X) 
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(19) f(KX) =/(^) 

for all K = diag (反 u ，…， K qq \ Then the distribution of f(X\ where X has the 
arbitrary density (16) ， 紅认 e same as the distribution of f(X\ where X has the 
normal density (16). 


Proof. The proof is similar to the proof of Theorem 4.5.4. ■ 


It follows from Theorem 9,11,3 that V has the same distribution under the 
null hypothesis H when X has the density (16) and for X normally dis¬ 
tributed since V is invariant under the transformation A"-> KK, Similarly, V t 
and the criterion (14) are invariant, and hence have the distribution under 
normality. 


PROBLEMS 

9.L (Sec, 93) Prove 

„ yh nr-,r[i ( n + 1-0 + h]nu{nf^T[^n + i -川 } 

’ — nr„,r[^(n+1 +1 -；) +a]) 

by integration of V h w{A\ 2 0 , n). Hint: Show 

^ Vl，= K(l 0 ,n + 2h) I IjJi Ij4;,1 hw( - A, + 

where K(1,,n) is defined by = e~ ^ trS * l/l . Use 

Theorem 7.3.5 to show 


<fK fc = 


K(l 0 ,n) ^ 

K(l a ,n + 2h) 


K{^, n n+2h) 


j ■…/阶(^4 ; 【|2, 7 ，《)以 


9.2. (Sea 9.3) Prove that if /?j =/? 2 =^ 3 = 1 [Wilks (1935)] 


^{V<o}=l u [{{n-\),{] + 2B^[\{n- 1) ， |]siiT 1 


[Hint: Use Theorem 9.3.3 and Pr{K<^}= 1 - Pr{y < V}.] 
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9J. (Sec. 9.3) Prove that if — p 3 = 2 [Wilks (1935)] 

Pr{V<u)^I fu (n-5A) 

+ B -I (” 一 一 5) 卜 /6 — 吾 (n — l)\/y — 4(/7 _ 4)u 

+( 如-咢 K /2 

一 - 2)u log y — 士 （n - 3)y 3/： log u]. 

[Hint: Use (9).] 

9A (Sec. 9.3) Derive some of the distributions obtained by Wilks (1935) and 
referred to at the end of Section 9.3.3. [Hint: In addition to the results for 
Problems 9.2 and 93, use those of Section 9.3.2.] 

9.5* (Sec. 9.4) For the case p t = 2, express k and y 2 . Compute the second term of 
(6) when u is chosen so that the first term is 0.95 for p = A and 6 and N — 15. 

9.6, (Sec. 9.5) Prove that if BAB f - CAC 1 = I for A positive definite and B and C 
nonsingular then B = QC where Q is orthogonal. 

9.7. (Sec. 9.5) Prove N tfmes ⑵ has a limiting ^^distribution with / degrees of 
freedom under the null hypothesis. 

9-8. (Sec. 9.8) Give the sample vector coefficient of alienation and the vector 
correlation coefficient. 

9*9. (Sec. 9.8) If y is the sample vector coefficient of alienation and i the square 
of the vector correlation coefficient, find ^y K z u when Y ~ 0. 

9.10. (Sec. 9.9) Prove 



(1 +!：/>, 2 ) 


Y r *•* du p < 


if p <n, [Hint: Let y ; - - w } 



~ 1, iii turn，! 


9J1. Let x { - arithmetic speed, x 2 = arithmetic power, = intellectual interest, 
jc 4 ― soc al interest, x 5 - activity interest. Kelley (1928) observed the following 
correlations between batteries of tests identified as above, based on 109 pupils: 


1.0000 

0.4249 

一 0.0552 

-0.0031 

0.1927 

0.4249 

1.0000 

-0.0416 

0.0495 

0.0687 

— 0.0552 

一 0.0416 

1.0000 

0.7474 

0.1691 

一 0.0031 

0.0495 

0.7474 

1.0000 

0.2653 

0.1927 

0.0687 

0.1691 

0.2653 

1.0000 
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Let x^ Ur - and x {2), = (x 3 , x 4J at 5 ). Test the hypothesis that x ⑴ is 

independent of x [2) at the 1% significance level. 

9.12. Cany oul the same exercise on the data in Problem 3.42. 

9.13. Another set of time-study data [Abruzzi (1950)] is summarized by the correla¬ 
tion matrix based on 188 observations ： 


1.00 

-0,27 

0.06 

0,07 

0.02 

-0.27 

1.00 

-0.01 

-0.02 

-0.02 

0.06 

-0.01 

LOO 

-0.07 

-0.04 

0.07 

一 0.02 

-0.07 

1.00 

-0.10 

0.02 

-0.02 

-0.04 

-0.10 

1.00 


Test the hypothesis that - 0,; ^ /, at the 5% significance level. 



CHAPTER 10 


Testing Hypotheses of Equality 
of Covariance Matrices and 
Equality of Mean Vectors and 
Covariance Matrices 


10.1. INTRODUCTION 

In this chapter we study the problems of testing hypotheses of equality of 
covariance matrices and equality of both covariance matrices and mean 
vectors，In each case (except one) the problem and tests considered are 
multivariate generalizations of a univariate problem and test. Many of the 
tests are likelihood ratio tests or modifications of likelihood ratio tests. 
Invariance considerations lead to other test procedures. 

First, we consider equality of covariance matrices and equality of covari¬ 
ance matrices and mean vectors of several populations without specifying the 
common covariance matrix or the common covariance matrix and mean 
vector. The multivariate analysis of variance with random factors is consid¬ 
ered in this context. Later we treat the equality of a covariance matrix to a 
given matrix and also simultaneous equality of a covariance matrix to a given 
matrix and equality of a mean vector to a given vector. One other hypothesis 
considered, the equality of a covariance matrix to a given matrix except for a 
proportionality factor, has only a trivial corresponding univariate hypothesis. 

In each case the class of tests for a class of hypotheses leads to a 
confidence region. Families of simultaneous confidence intervals for covari¬ 
ances and for ratios of covariances are given. 


An Introduction to Multivariate Statistical Analysis, Third Edition t By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc, 
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The 

treated 


application of the tests for elliptically contoured distributions is 
in Section 10.11. 


10.2. CRITERIA FOR TESTING EQUAUTY OF SEVERAL 
COVARIANCE MATRICES 


In this section we study several normal distributions and consider using a set 
of samples, one from each population, to test the hypothesis that the 
covariance matrices of these populations are equal. Let a = … ， N s ， 

g = 1 ， ... ， g，be an observation from the gth population ⑷， 2 g ). We wish 
to test the hypothesis 

(1) ^1 = ••- =S 9 . 

Let = N, 

⑺ A g = E ( 巧 ) -i ⑷ )(4)-i ⑴) , ’ g = l ，…， q, 

a = l 

q 

i 


First we shall obtain the likelihood ratio criterion. The likelihood function is 


(3) L= U 


■exp 




2 E ( 巧 ) - ⑴) Wif - p u) ) 


The space fl is the parameter space in which each is positive definite and 


any vector，The space o) is the parameter space in which S] = S 2 = … 
= 2^ (positive definite) and p) is any vector. The maximum likelihood 
estimators of and 2^ in ft are given by 


(4) 碎 )=i ⑴， V = 

s 



The maximum likelihood estimators of \i [g) in are given by (4), =x^ 8 \ 

since the maximizing values of \i {g) are the same regardless of The 
function to be maximized with respect to 2!= … = 2^ = 2, say, is 


( 5 ) 




n s 


-5 E E (Jtns-'or - 元 


{g)] 
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By Lemma 3.2.2, the maximizing value of i is 


( 6 ) 


^N A ' 


and the maximum of the likelihood function is 


⑺ 






The likelihood ratio criterion for testing (1) is 


A, 


li,l^ 




The critical region is 

(9) WO ， 

where A/e) is defined so that (9) holds with probability s when (1) is true. 

Bartlett (1937a) has suggested modifying A, in the univariate case by 
replacing sample numbers by the numbers of degrees of freedom of the A^. 
Except for a numerical constant, the statistic he proposes is 


( 10 ) 


V i 


-M g \^ 


\A.& 


where n g = N g — l and n — = N - q. The numerator is proportional 

to a power of a weighted geometric mean of the sample generalized vari¬ 


ances, and the denominator is proportional to a power of the determinant of 
a weighted arithmetic mean of the sample covariance matrices. 

In the scalar case (p = l) of two samples the criterion (10) is 


( n ^ + n 2 s 2 2 Y {n ' +ni) ( n 1 F + n 2 )^ , ' +, ' l)， 

where s] and s\ are the usual unbiased estimators of erf and (the two 
population variances) and 
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Thus the critical region 

(13) 

is based on the F-statistic with n { and n 2 degrees of freedom，and the 
inequality (13) implies a particular method of choosing F { (s) and F 2 (e) for 
the critical region 

(14) F<F { {s), F>F 2 (e). 

Brown (1939) and Scheffe (1942) have shown that (14) yields an unbiased 
test. 

Bartlett gave a more intuitive argument for the use of V x in place of A,. 
He argues that if N [y say, is small, A x is given too much weight in A,, and 
other effects may be missed. Perlman (1980) has shown that the test based on 
V x is unbiased. 

If one assumes 

(15) 娜 = M /)- 

where zf) consists of k g components, and if one estimates the matrix p g ， 
defining 

^ a ^ 

O ) ^ E ft 4 s) )’， 

a- I 

one uses (10) with = AT, - 

The statistical problem (parameter space fi and null hypothesis (o) is 
invariant with respect to changes of location within populations and a 
common linear transformation ^ 


(17) 久 * ⑻ =CX ⑴ + v u ’)， g= 

where C is nonsingular. Each matrix A g is invariant under change of 
location, and the modified criterion (10) is invariant: 


8 ) V* 


l \A^ Ul^\CA g C'\^ n^lA^ 




~\CAcW 


I A I Xn 


: T/. 


Similarly, the likelihood ratio criterion (8) is invariant. 
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An alternative invariant test procedure [Nagao (1973a)] is based on the 
criterion 

(19) I En g tr(S g S--/) 2 =| En g tr(S g -S)S-'(S g -S)S-', 
where S g = (l/n g )A g and S = {l/n)A. (See Section 7.8.) 


10.3. CRITERIA FOR TESTING THAT SEVERAL NORMAL 
DISTRIBUTIONS ARE IDENTICAL 


In Section 8.8 we considered testing the equality of mean vectors when we 
assumed the covariance matrices were the same; that is, we tested 

(1) // 2 : = … =(jl ⑷ given 2 ^ 22 *** = 


The test of the assumption in H 2 was considered in Section 10.2. Now let us 
consider the hypothesis that both means and covariances are the same; this is 
a combination of //, and H 2 . We test 

(2) //:〆】) = jt (2) = … =2! = 2 2 = … 

As in Section 10.2, let a = 1,..., be an observation from 2 g ), 

g= 1,...,^. Then O is the unrestricted parameter space of {ft (g) ,2 g ), g = 
1,..., where is positive definite, and oj* consists of the space restricted 
by (2). 

The likelihood function is given by (3) of Section 10.2. The hypothesis //, 
of Section 10.2 is that the parameter point falls in the hypothesis H 2 of 
Section 8.8 is that the parameter point falls in oj* given it falls in oj^> oj*; 
and the hypothesis H here is that the parameter point falls in given that 
it is in O. 

We use the following lemma: 


Lemma 10.3.1. Let y be an observation vector on a random vector with 
density /(z, 0), where d is a parameter vector in a space Cl. Let H a be the 
hypothesis 0 ^ c let H b be the hypothesis 0 e c given 0 e 
and let H ab be the hypothesis 0 e n 6 , given d If A fl , the likelihood ratio 
criterion for testing H a ，\ b for H b ，and \ ab for H ab are uniquely defined for the 
observation vector then 


( 3 ) 


Kb = KK ， 
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Proof. The lemma follows from the definitions: 


( 4 ) 

K = 

max e 

en a /(y， 0 ) 

max ( 

>en /( 3 N 0 ) ’ 

( 5 ) 

= 

max e 

i e n k /( 3 -, 0 ) 

max Q 


( 6 ) 

、b = 

rnax fi 

e n 6 /(：V， 0 ) 

max. 

aen/( 3 -, 0 )' 


Thus the likelihood ratio criterion for the hypothesis H is the product of 
the likelihood ratio criteria for H x and H ly 


( 7 ) 


where 

( 8 ) 


A = A I A Z 


/ =i 




b = E E 

g= l ^ = 1 
Q 

-A+ E 
8=^ 


The critical region is defined by 

(9) ASA ⑷， 


where A(^) is chosen so that the probability of (9) under H is e. 

Let 

( 10 ) = 

this is equivalent to A 2 for testing H 2 , which is A of (12) of Section 8.8. We 
might consider 


(ii) 


y=v t y 2 = 


n^,L4 g i^ 

\B\^ n 


However, Perlman (1980) has shown that the likelihood ratio test is unbiased. 
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10.4. DISTRIBUTIONS OF THE CRITERIA 

10.4.1. Characterization of the Distributions 

First let us consider V l given by (10) of Section 10.2. If 


1 / 1 , + - " +, '^ 0 \ Ag \ 

\A X + ••- .. + 〜） 


g = 2, •••，？• 


then 

(2) = 


Theorem 10.4*1, ^ 12 ,^ defined by (1) are independent when 

2!= … =2 9 and n g >p, g^ 1,..., 

The theorem is a consequence of the following lemma: 

Lemma 10A1* If A and B are independently distributed according to 
M2, m) and respectively, n >p, m>p，and C is such thu! C(A + 

B)C r = /, then A + B and CAC f are independently distributed] A + B has the 
Wishart distribution mth m + n degrees of freedom y and CAC’ has the multivari¬ 
ate beta distribution with n and m degrees of freedom. 

Proof of Lemma. The density of D =A + B and E = CAC' is found by 
replacing A and B in their joint density by C - 1 £C ■一 1 and D - C' l EC~ l = 
C~ l U — E)C f ~\ respectively, and multiplying by the Jacobian, which is 
mod|C| ^ {p+l) = \D\ 心 +l )，to obtain 


( 3 ) 


> ,rS " D lD|^ ( P + n 

=+ n)\D\^ m + n ~' , - n e- 1 ^^' D 


.r p (|m)r p (^) 


|£| k^rn-p- 1 )| j — 五 I \{n-p-\) 


for D, £, and I — E positive definite. 


■ 
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Proof of Theorem. If we let = D g and C g (A 1 4 - J tA g _ l )C , g 

= E g ，where C g D g C g = I, g = 2 ， ... ， q, then 

,_ |C7 1 £ v C; _l |>' + 〜 + 〜 -i)|C; 1 (/_4)C;- l |W 
(4) *4' I C _ 丨 C ,_ 丨 I ;(，' 1 + .. +,, x)" 

=g = 2,...,q, 

and E 2 , …， E are independent by Lemma 10.4.1. ■ 


We shall now find a characterization of the distribution of V lg . A statistic 


V x ^ is of the form 


|g| b |C| c 

\B + C\ b+c 


Let B, and C' be the upper left-hand square submatrices of B and C, 
respectively, of order i. Define b (t) and c (1) by 


K 




c >-i c (o| 

^ r 


2,,,., p. 


Then ⑸ is (B 0 = C 0 = /, 6 (I) — c (I) = 0) 


\B\ b \C\ c 
|B + Cl“ c 


M ib,+ c,.i b+c 

PJ 及 :' ^(o) ( c »~ c (o) 

，=I +C r, _ (〜) +C (【)) , (U C 【 -I) _I ( 6 (,.) +C ⑴乂 


yi < ^i 7 i-i C i C ； -i -1 


_ <A,_ 丨 +c„.,_ I ) b+c _ 

+ C nr-1 + 办 [I ) 忍 i-\ 办 (【)+ C [i)^i-\ C (i) 
_ (*(o + c (i))'( B .-i + c <-i) _I ( 6 (0 + c (o)] 
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where 、 • 卜 ' =b“- bUr—'b 。、 and =c h - c (l) . The second term 

for / = 1 is defined as 1, 

Now v/e want to argue that the ratios on the right-hand side of (7) are 
statistically independent when B and C are independently distributed ac¬ 
cording to m) and HKCX, n\ respectively. It follows from Theorem 4.33 
that for B { _ 1 fixed b ⑺ and , are independently distributed according to 

and x 2 wi th m — （/- 1) degrees of freedom, re¬ 

spectively. Lemma 10.4.1 implies that the first term (which is a function of 
) is independent of 一 ' 

We apply the following lemma: 

Lemma 104.2. For B i _ l and Q_, positive definite 

(8) 《〖 )4 一 '〜)+ c coC,-i c (/) _ (^(o + c ⑴ + Q-,) 乂 〜) + c ⑴） 

= ( 忍 , 办(，厂 + Qj]) — c co) F 

Proof. Use of (B l + = [C^ { (B + C)B { ]- 1 = B(B + C)- [ C 

shows the left-hand side of (8) is (omitting i and / - 1) 

(9) 

+ C~ ] )~\b~ 1 +C _i )A + C" I )(B" 1 + C~ l ) l C~ l c 

-(b +c)'B — 1 (B _I + C~ ] )~ l C~ l (b+c) 

+ C~ i y l B l b ^c f C- l (B l + C~ i y l C~ l c 

-b r B~ l (B~ l +C~ l )~ l C^ l c-c , C~ l (B~ l + C-~ l )B~ l b y 

which is the right-hand side of (8). ■ 

The denominator of the z'th second term in (7) is the numerator plus (8). 
The conditional distribution of - C~J l c {l) is normal with mean 

B；! 1 , P (I) - an d covariance matrix + C^l 1 ,). The covari¬ 

ance matrix is cr^. f _| times the inverse of the second matrix on the right-hand 
side of ⑻. Thus (8) is distributed as x 2 with i - 1 degrees of freedom, 

independent of B,_ lt Q_,, , and c". f _ P 

Then 


( 10 ) 




(〜-，-】 + C iii -\) 


b + c 


- 








is distributed as A^(l — X ( ) c , where X t has the (3[j(m — f + l),|(n - / + 1)] 
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distribution, / = 1， • • • ， p. Also 


A ，”-】 … i-i + ⑻ 


/ = 2,… ， p, 


is distributed as Y { b+C , where Y t has the /3[^(m + «) — /+ l ， |(i ■一 1)] distribu¬ 
tion. Then (5) is distributed as - A r [ ) c n/L 2 ^ fe+C » and the factors 

are mutually independent. 

Theorem 10.4,2. 


(i2) v, = n I 

g = 2 1 * = * /=»2 

where the X 7 s and Y y s are independent, X ig has the /3[^(nj + ••- +n„_, - i + 1), 
^(n g 一 / + 1)] distribution，and Y ig has the /3 [ 士 (\ + … +n g ) — i 4- 1 ， 一 1)] 
distribution. 



Proof. The factors l / l2> ...,I / l9 are independent by Theorem 10,4.1. Each 
term V lg is decomposed according to (7)，and the factors are independent. 


The factors of K, can be interpreted as test criteria for subhypotheses. 
The term depending on X i2 is the criterion for testing the hypothesis that 
,, and the term depending on Y l2 is the criterion for testing 
O = or ((没 given = d, and 艺卜 】 】 = 2, 一 12 - The terms depending 

on X ig and Y ig similarly furnish criteria for testing 5^ = given 2,= … = 


^8~\ w 

Now consider the likelihood ratio criterion A given by (7) of Section 10.3 


for testing the hypothesis = … =jjl ⑷ anc 1 2,= … = It is equivalent 
to the criterion fc 


一 

l / i , + ■- 糾 + …十〜 

_ \ A , + - + 么 |以 _ 

' \A, + - +A q + 


The two factors of (13) are independent because the first factor is indepen¬ 
dent of { + … +A q (by Lemma 10.4.1 and the proof of Theorem 10.4.1) 
and of Jc (1) jc w) . 
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Theorem 10.4.3 

(14) fV= fl ( flAj^' + - + ^-^(1 +N Aflzr\ 

\i=l 1 = 2 J i= 1 

where the X’s, Y y s y and Z J s are independent^ X lR has the jSt-iC/ij + •** +/i c . { — 
i + l) y \(n g — i + 1)] distribution ， Y ig has the + … 尽）一 / + 

distribution, and Z■ has the /3[|(n + 1 - /),4(? — D] distribution. 

Proof The characterization of the first factor in (13) corresponds to that 
of V x with the exponents of X lg and 1 - modified by replacing by N g . 
The second term in and its characterization follows from Theorem 

8.4 丄 ■ 

10.4.2. Moments of the Distributions 

We now find the moments of V { and of W, Since 0 < Kj < 1 and 0 < < L 

the moments determine the distributions uniquely. The hth moment of V { 
we find from the characterization of the distribution in Theorem 10.4.2: 

(15) 

<i ( P ' P \ 

n n +|)a (i - x iq ) v ^ fi n + * * +fivUi 

g=-2 \ I f *=2 J 

= ^ f^ T r[i( ni —.. +ng _ 1 )(i + /p-i(/ — i)] 

8-2 \.= i r[|(n】+ … -i’+ 1 )] 

+^) - - !)] r [K n i + ••• +n g ) - t + l] 

’ r[i (〜 -f + i)]r[K« 1 + … + 〜) + 

A 中 (〜 + "•+n g )(l+/” _f’+ 1] 中 0 丨 + … + 〜 -/+1)] \ 

i -2 r[^(n, + … +n g ) - ; + 1] 市 (《】 + … + n g )(l + /i)- 去 (/ - 1)] j 
= A f T[\(n + 1 -/)] ^ r[j(n g + /7n £ + 1 -/)] 1 

\ r [k« +/j « + 1 ~o] s=i +1 ~o] j 

r p(2») A + 

- T p [^(n + hn)} r p ( 沁） . 
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The hth moment of W can be found from its representation in Theorem 
10.4.3. We have 

(16) 

- n IId,， + .. +,V <-' ,A ([ - x lg y Ngl 'Y\ ... +Nf)h „ 

g=2 r=1 t=2 

= A [ A r [ 去 (” ! + … +» g _, +1 - Q + + … +"g-i)] 

s- 1 - + … + v ] -0] r [K n « +1 ~0] 

r [K 〜 + + + … +n g ) ~i+ l] 

F [ 去 (” i + … +〜）+ ! 六 ("1 + …+ 1 — e j 

〜 + … + 〜） + 士 /](+ … + ATg ) + 1 — i ] 

r[A(”i + 〜 + 1 - 0] 

「[4(”，+ …+”《 + 1 — / )+ 士 /i ( A /| + … + )] 

A r[4(” + 1 -i^hN)]r[\(N-i)] 

•卩 rlk^ + i-OlrlW + ^-O] 

r [ 狀 + /f )1 ) ^[i{N-i)} 

r[K^-0] I r[|(m/)] 

- r p( ^ + ㈣ r g (\n g ) ‘ 

We summarize in the following theorem: 

Theorem 10.4 A Let V { be the criterion defined by (10) of Section 10*2 for 
testing the hypothesis that : X, = ••- = 2 V , where A R is times the sample 
covariance matrix and n g -f 1 is the size of the sample from the gth population、，let 
W be the criterion defined by (13) for testing the hypothesis //: 
and H u where B — A + Z g N g (x (g ^ —x)(x (g ^ — v)\ The /ith moment of V' when 
H x is true is given by (15). The hth moment Oj the criterion for testing H ， is 
given by (16). 

This theorem was first proved by Wilks (1932). See Problem 10.5 for an 
alternative approach. 
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If p is even, say p = 2r, we can use the duplication formula for the gamma 
function [r(a+ i)F(a + 1)= ^r(2a+ l)2- 2 °]. Then 


(17) 


= n 


n 


T(n g + hn g + l~2j) 

~ rK + 1-2；) 


r( n +1 - 

r (az + hn + 1 一 2j) 


and 




r(n g + ^ g + l-2j) 

r(n g + 1 - 2 ;') 


r(N-2j) 1 

T(N + hN~2j) r 


In principle the distributions of the factors can be integrated to obtain the 
distributions of V x and W. In Section 10.6 we consider when p = 2, 9 = 2 
(the case of p — l 9 q = 2 being a function of an ^-statistic). In other cases, 
the integrals become unmanageable. To find probabilities we use the asymp¬ 
totic expansion given in the next section. Box (1949) has given some other 
approximate distributions. 


10*4.3* Step-down Tests 

The characterizations of the distributions of the criteria in terms of indepen¬ 
dent factors suggests testing the hypotheses H y and H by testing component 
hypotheses sequentially. First, we consider testing H l ： 2i =2 2 for q-2. 
Let ^ 



! x^) 


(X(J) n 、 

(19) 硝 ) = 

(i-O 

， 碎 h 

户 ( 卜 1 ) 




难 0 

(g)f 


Of, 


(0 


吨 ) 


/ = g 一 1 , 2 . 


The conditional distribution of given X^ 8 }^ is 

(20) N[ ^ + or 涉％ - 峨 ,)) ， a^\. 


where It is assumed that the components of X 

have been numbered in descending order of importance. At the ith step the 
component hypothesis a^}}^ t { is tested at significance level e { by 

means of an F-tcst based on 5, and 5 2 are partitioned like 2 ⑴ 

and 2 (2) . If that hypothesis is accepted, then the hypothesis 0 *((/)) = or 资 (or 
1 = 2^i " 1 or ( ( r 2 )) is tested at significance level 5, on the assumption 

that 2}!：! = (a hypothesis previously accepted). The criterion is 


(21) 


(把 r 1 沿-奶 r 1 墙 )，㈤ V + grWo) 

(j— 
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where -f m 2 — 2i + ! = (A — / + l)sH +(n 2 — i + 1)4?) 卜卜 Under 

the null hypothesis (21) has the ^-distribution with i - 1 and 〜+ n 2 - + 2 

degrees of freedom. If this hypothesis is accepted, the (/ + l)st step is taken. 
The overall hypothesis = 2 2 is accepted if the 2p- 1 component hy¬ 
potheses are accepted. (At the first step, erff)) is vacuous) The overall 
significance level is 

(22) 1- FI (1 ~ e i) FI (1 ^ 

I j=2 

If any component null hypothesis is rejected, the overall hypothesis is 
rejected 

If q> 2, the null hypotheses 。… is broken down into a 

sequence of hypotheses [l/(g — 1 )K 2 ] + …+ 艺卜 ！ ）= and tested sequen¬ 
tially, Each such matrix hypothesis is tested as X, = X 2 with S 2 replaced by 
S g and Sj replaced by [l/(n, + … 4-n g-l )](^, + ••• +、_■)• 

In the case of the hypothesis H, consider first q = 2, 2| = 2 2 > and 
= |jl (2 \ One can test 2 y = 2 2 . The steps for testing = jjl (2) consist of 
卜 tests for /x\ [) = /4 2 ) based on the conditional distribution of A"/ 1 ) and X^ 2) 
given and Alternatively one can test in sequence the equality of 

the conditional distributions of X^ ]) and X^ 2) given x ( ( ]\and 

For q> 2, the hypothesis 2, = ••• = can be tested, and then = ••• 
=Alternatively, one can test [1/(^- 1)](2, + +2 g -,) = 2 g and 

[1/( 茗 — 1)](|JL ⑴ + … 


10.5, ASYMPTOTIC EXPANSIONS OF THE DISTRIBUTIONS 
OF THE CRITERIA 


Again we make use of Theorem 8.5.1 to obtain asymptotic expansions of the 
distributions of V : and of 入 . We assume that n g = /c g n, where Tf i g ^ l k g = 1. 
The asymptotic expansion is in terms of n increasing with k v •” k q fixed. 
(We could assume only lim n g /n = /c g > 0 .) 

The hth moment of 


( n V pn 

(i) ㈣ • 卩 k) 


脱 : 


\pn 


K 


IS 


( 2 ) = k 






nf„r[i n (i+M + Ki-y)] 



10.5 ASYMPTOTIC EXPANSIONS OF DISTRIBUTIONS OF CRITERIA 


425 


This is of the form of ( 1 ) of Section 8,6 with 



b =P ， = ^ - 

(3) 

pq' x k = {n s , 



Then 


(4) 

/一似 


J = 1 ， • ♦ •，/ ^ ， 


k = i，p + …， [q — ])p + i , i = I ， ，- ,， p 


= - ? E (i -0 - E (i -；') - { qp ~ p ) 

I、 I ;=) 

= - [-<i{p(p- i) + ip(p~ i) - (g - Up] 

=_ Up(p + ”， 

e J = |(1 - p)n' j = 1 ， …， /?， and /^ = = ^(1 - k = {g - Dp 

+ I?»»*» SP i 

In order to make the second term in the expansion vanish, we take p as 


(5) 

Then 


1 ^ 1 1 j 2p z + 3/7-1 

p UiKLLu. 


P(P + 1 ) (P - L)(p +2 ) E 


( 6 ) ^2 

Thus 


L - - 2 

=i n 

48 p 2 


- 6(q- 1)(1 - p)' 


(7) Pr{-2plog At <z} 

=Pr( X f ^ 2 } +w 2 [Pr{ 々 2 +4 <z) -Pr(^/<z)] +0(” —）， 


Let A = WN^ S n ^ =1 Af The hth moment is 
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This is the form (1) of Section 8.5 with 

<i 

b =P- >', = i N= 2 E « f = ~ij, j = 1. . 

g -1 

(9) a =pq, x k = ^N g , (g- 1)P + g = l,...,q, 

£ k = k = i,p+i . (q-l)p + i, i 

The basic number of degrees of freedom is /= \p(p + 3X? — 1)、We use (11) 
of Section 8.5 with p k = (l - p)x k and e 广 （ 1 ， p)y r To make o) l = 0, we 
take 


( 10 ) 

Then 


P 



1 1 \ 2p 2 h- 9p + 11 

AT - nJ 6 ( 9 -1)(p + 3)' 


(11) - ^2 j(p + l)(p-!- 2 ) - 6(1 - p) 2 (g- 1) 

The asymptotic expansion of the distribution of — 2p log 入 is 

(12) Pr{ -2p log Ac} 

=Pr{ Xf < z) + w 2 [Pr{^/ +4 < 2 ) -Pr{^/ <z}] +0(« -3 ). 

Box Cl949) considered the case of A* in considerable detail. In addition to 
this expansion he considered the use of (13) of Section 8.6. He also gave an 
F-approximation. 

As an example, we use one given by E, S. Pearson and Wilks (1933). The 
measurements are made on tensile strength (^) and hardness (A" 2 ) of 
aluminum die castings. There are 12 observations in each of five samples. 
The observed sums of squares and cross-products in the five samples are 


(13) 



78.948 

214.18 


214,18) 

1247.18]' 


'223.695 

,657.62 


657.62) 
251.931 J 


= 57.448 190.63) 

3 _ (190.63 1241.78 


= (187.618 375.91) 

4 1 375,91 147344 j’ 



88.456 

259,18 


259.18 、 
1171.73j’ 
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1697.521 
7653.44 J' 

The - log A* is 5.399. To use the asymptotic expansion we find p= 152/165 
= 0.9212 and a) 2 = 0.0022. Since is small, we can consider - 2p log A* as 
X 2 with 12 degrees of freedom. Our observed criterion ， therefore, is clearly 
not significant. 

Table B,5 [due to Korin (1969)] gives 5% significance points for - 2 log A* 
for N x = - ^N q for various q, small values of N g , and p = 2(1)6. 

The limiting distribution of the criterion (19) of Section 10.1 is also x[ An 
asymptotic expansion of the distribution was given by Nagao (1973b) to terms 
of order l/n involving ^ 2 -distiibutions with /, /+ 2, /+4, and /+ 6 
degrees of freedom, 

10,6, THE CASE OF TWO POPULATIONS 
10-6.1, Invariant Tests 

When 9 = 2, the null hypothesis H { is 5^ = It is invariant with respect to 
transformations 

(1) 〆)=Cx") 十^ 1 )， x* (2) - Cx( 2 ) + v ( 2 )， 

where C is nonsingular. The maximal invariant of the parameters under the 
transformation of locations (C = /) is the pair of covariance matrices Xj, 2 2 , 
and the maximal invariant of the sufficient statistics x (i \S [y x a \S 2 is the 
pair of matrices S iy S 2 (or equivalently A ly A 2 X The transformation (1) 
induces the transformations 2* = CSjC', 2* = CX 2 C\ Sf = CSjC', and 
S* = CS 2 C\ The roots A, > A 2 ^ > \ p of 

(2) 12,-AD 2 | -0 
are invariant under these transformations since 

(3) 12 卜入 = |C2 1 C , - AC2 2 Cl =|CC , M2,-A1： 2 U 

Moreover, the roots are the only invariants because there exists a nonsingular 
matrix C such that 

(4) C^ Y C ! Cl 2 C f =!， 

where A is the diagonal matrix with as the ith diagonal element, 
i = 1， (See Theorem A.2.2 of the Appendix.) Similarly, the maximal 


and the sum of these is 

(14) 


636.165 

1697.52 
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invariants of S { and S 2 are the roots /! > / 2 > •** > l p of 

(5) IS, - /S 2 | =0. 

Theorem 10.6«1. The maximal invariant of the parameters of 
and S 2 ) under the transformation (\) is the set of roots 入 , 之 … t of 

(2 )、 The maximal invariant of the sufficient statistics i ⑴， S,, S 2 is the set of 
roots ~ 乏 … > l p of (5). 

Any invariant test criterion can be expressed in terms of the roots 
/”•••，/〆 The criterion V x is nf prl] n^ n2 times 

I5,|^'|5 2 |^ = lLlH / 吟 ： = A ^ 

() - ~ (n [ l,+n 2 ) h， 


where L is the diagonal matrix with /. as the ith diagonal element. The null 
hypothesis is rejected if the smaller roots are too small or if the larger roots 
are too large, or both. 

The null hypothesis is that \ … A p ~ 1. Any useful invariant test of 

the null hypothesis has a rejection region in the space of l'” 、 ”l p that 
includes the points that in some sense are far from l { =l p = 1. The 
power of an invariant test depends on the parameters through the roots 
A i ， . “ ， A# • 

The criterion (19) of Section 10.2 is (with nS = n l S l +n 2 S 2 ) 


⑺ 


^ 1 tr[(5 1 -5)5-'] 2 + ^ 2 tr[(5 2 -5)5- 1 ] 2 
= ^M l tr[c(5 1 -5)C , (C5C , )' 1 ] 2 
+ ^« 2 tr [c(5 2 -5)C , (C5C')' 1 ] 2 



+ 士行 2 tr 




2 


P 

^{n l n 2 n £ 

/ = l 


(卜1) 2 
(«!；, +« 2 ) 2 


This criterion is a measure of how close l' t …， l p arc to I; the hypothesis is 
rejected if the measure is too large. Under the null hypothesis, (7) has the 
/-distribution with /= \p( /> -f 1) degrees of freedom as /z 1 co, n 2 co, 
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and n,/n 2 approaches a positive constant. Nagao (1973b) gives an asymptotic 
expansion of this distribution i,o terms of order l/n. 

Roy (1953) suggested a test based on the largest and smallest roots, and 
l p . The procedure is to reject the null hypothesis if /, > k { or if l p < k p> 
where h and k p are chosen so that the probability of rejection when \^1 
is the desired significance level. Roy (1957) proposed determining k K and k f} 
so that the test is locally unbiased，that is, that the power functions have a 
relative minimum at A =/. Since it is hard to determine k x and k p on this 
basis, other proposals have been made. The li.nit k' can be determined so 
that Pr {/ , >is one-half the significance level, or ?r[l p <k p \H is 
one-half of the significance level* or 众亂 + = 2, or k { k p = 1, In principle /c, 

and k j} can be determined from the distribution of the roots, given in Section 

13.2. Schuurmann, Waikar, and Krishnaiah (1975) and Chu and Pillai (1979) 
give some exact values of k { and k p for small values of p. Chu and Pillai 
(1979) also make some power comparisons of several test procedures, 

In the case of p = 1 the only invariant of the sufficient statistics is S x /S 2 y 
which is the usual F-statistic with and n 2 degrees of freedom. The 
Criterion V { is (A l /A 2 )^ n '[l A X /A 2 )Y the critical region K, less than a 
constant is equivalent to a two-tailed critical region for the F-statistic. The 
quantity n(B -A)/A has an independent /^distribution with 1 and n de¬ 
grees of freedom. (See Section 103.) 

In the case of p = 2, the hth moment of V l is, from (15) of Section 10.4, 

/ns r(n i +hn l - 1)T(» 2 + /tn 3 - l)F(n - 1) 

、 8 ) < S > V ] - r ( n 「 1)r ( n2 — 

where X { anil X 2 are independently distributed according to (Hx\n { - L 
n 2 - 1) and (i(x\n ] + — 2,1 入 respectively. Then PKKj < u} can be found by 

integration. (See Problems 10.8 and 10.9.) 

Anderson (19b5a) has shown that a confidence interval for a f X x a/a / X z a 
for all a with confidence coefficient e is given by where 

?T{(n 2 ~p + 1)L ^n 2 F fli „ 2 _ /) + l }Pr{(n, -p + l)F ni _ p+l {: /} = 1 - e. 

10.6.2, Components of Variance 

In Section 8.8 we considered what is equivalent to the one-way analysis of 
variance with fixed effects. We can write the model in the balanced case 
(N } = N 2 = * * * =N q ) as 

(9) X { a s) - ^ + 

jjl + + U^ 8 \ a=l”.. ， M ， g= 1. q. 
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where = 0 and = = — |x, and |x = (l/g)E: &l jx( 客） 

7 jVj, = 0). The null hypothesis of no effect is = 0. Let 

i tK) = (1 /M) ， L s J^ { x i a g) and x = (l/g)Eg^!i tg> . The analysis of variance table 
is 

Degrees of 

Source Sum of Squares Freedom 

Effect H = M E (i ⑷一 - i)’ ^ - 1 

i 

il M 

Error G= D D (x<f) - - jc ( ^X q(M - 1) 

g = 1 a™ I 

q M 

Total E E (巧) 一无 ) - i)' qM — 1 

g « 1 a® 1 

Invariant tests of the null hypothesis of no effect are based on the roots of 
Iff-mG| =0 or of \S h - lS e \ 0, where S h = [l/(q - l)]H and 
[\/q(M - 1)]G. The null hypothesis is rejected if one or more of the roots is 
too large. The error matrix G has the distribution W{\,q{M - 1)). The 
effects matrix H has the distribution W^(X，^ - 1) when the null hypothesis is 
true and has the noncentral Wishart distribution when the null hypothesis is 
not true; ivs expected value is 

(10) SH={q- (K u) - K)(^ g ) -K)’ 

=(?- 0- + ^ E vl,. 

g = l 

The MANOVA model with random effects is 

(11) X ( J } = -f + U^\ a=l， …， M，g=l， …， t 

where has the distribution M0, 0). Then has the distribution 

M 卜 2 + ©). The null hypothesis of no effect is 

(12) ® = 0. 

In this model G again has the distribution VK(2, q(M — 1)X Since X ⑻ 二 （jl + 
V g -f* U is) has the distribution iV(|x，（l/A02 + ®)，H has the distribution 
\V(1 + MO, q - 1). The null hypothesis (12) is equivalent to the equality of 
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the covariance matrices in these two Wishart distributions; that is, X - 2 -f 
M@. The matrices G and H correspond to -4, and A 2 in Section 10.6.1. 
However, here the alternative to the null hypothesis is that (2 -f M0)— 2 is 
positive semidefinite, rather than 5^ 乒 2 2 . The null hypothesis is to be 
rejected if H is too large relative to G. Any of the criteria presented in 
Section 10.2 can be used to test the null hypothesis here, and its distribution 
under the null hypothesis is the same as given there. 

The likelihood ratio criterion for testing 0 = 0 must take into account the 
fact that © is positive semidefmite; that is，the maximum likelihood estima¬ 
tors of 2 and 2 +-M© under Cl must be such that the estimator of 0 is 
positive semidefinite. Let /,>/>> *•* >J f1 be the roots of 

(13) 


(Note [l/[q(M— 1)]}G and (1 /q)H maximize the likelihood without regard 
to @ being positive definite.) Let 1* = / ; if l f > 1, and let If = 1 if 1 { ^ L 
Then the likelihood ratio criterion for testing the hypothesis 0 = 0 against 
the alternative 0 positive semidefinite and 0¥= 0 is 


(14) 




1 *^ 


(/f + M-1) 




fi 


ir 




where k. is the number of roots of (13) greater than 1. [See Anderson (1946b)， 
(1984a)，(1989a), Morris and 01 kin (1964), and Klotz and Putter (1969).] 


10,7, TESTING THE HYPOTHESIS THAT A COVARIANCE MATRIX IS 
PROPORTIONAL TO A GIVEN MATRIX; THE SPHERICITY TEST 

10,7-L The Hypothesis 

In many .statistical analyses that are considered univariate, the assumption is 
made that a set of random variables are independent and have a common 
variance. In this section we consider a test of these assumptions based on 
repeated sets of observations. 

More precisely, we use a sample of p-comfx>nent vectors ♦，：〜，from 
N(jx, 2) to test the hypothesis ff: 2 = o' 2 JT, where cr 2 is not specified. The 
hypothesis can be given an algebraic interpretation in terms of the character¬ 
istic roots of 2, that is, the roots of 


(i) 


12 - <A/I = 0. 
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The hypothesis is true if and only if all the roots of (1) are equal) Another 
way of putting it is that the arithmetic mean of roots <f> p 巧 equal to 

the geometric mean, that is, 

nw" ni 1 ^ 

E/Li 4>i/p tr 27/? 

The lengths squared of the principal axes of the ellipsoids of constant density 
are proportional to the roots ^ (see Chapter 11); the hypothesis specifies 
that these are equal, that is, that the ellipsoids are spheres. 

The hypothesis H is equivalent to the more general form = cr 2 ^ 0> with 
V 0 specified, having observation vectors y\ 、 … 、 y N from N(vH Let C be 
a matrix such that 

(3) CAP 0 C =/, 

and let |x* = Cv, 2* = C^C\ x* = Cy a . Then ars observations 

from 2*), and the hypothesis is transformed into H: 2* = cr 2 /. 

10.7.2, The Criterion 

In the canonical form the hypothesis // is a combination of the hypothesis 
H v : 2 is diagonal or the components of X are independent and H 2 : the 
diagonal elements of i are equal given that i is diagonal or the variances of 
the components of X are equal given that the components are independent. 
Thus by Lemma 10.3.1 the likelihood ratio criterion A for H is the product of 
the criterion Aj for H l and A 2 for H v From Section 9.2 we see that the 
criterion for H x is 


(4) 


where 

⑸ 




N 

a 1 * I 


and r" = a if / • We use the results of Section 10.2 to obtain 入 2 by 

considering the ith component of x a as the ath observation from the ith 
population, (p here is q in Section 10.2; N here is N g there; pN here is N 

f This follows from the fact that 2 = O’ 电 O, where is a diagonal matrix with roots as diagonal 

dements and O is an orthogonal matrix. 
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there、）Thus 


(6) 

x n,.[E„(' a - ； o 2 广 

2 广、 


{ixA/p)-- pN 

Thus the criterion for H is 

⑺ 


It will be observed that A resembles (2). If l'，. … l p are the roots of 

(8) 

IS-//I =0, 

where S = (l/n)A y the criterion is a power of the ratio of the geometric 
mean to the arithmetic mean, 

( 9 ) 

, y\iWp\^ n 

Now let us go back to the hypothesis 屯 = o * 2 平 0 ， given observation 
vectors from N(v y ^), In the transformed variables {x*} the 

criterion is |^4 + | ^(tr ^/pY、 pN 、 where 

(10) 

N 

/l* = D (x*-x*){x* -X*)' 

a— l 


N 

=c E (y a -y)(y a -~y)'c r 

a = 1 


= CBC', 

where 


(li) 

B- L(y a -~y)(y a -yY. 

a^l 
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From (3) we have = C ~ 1 {C r y 1 = (C^C)" 1 . Thus 

(12) tr A* = \xCBC ^WBCC 

The results can be summarized. 

Theorem 10.7J. Given a set of p-component observation vectors , y N 
from N(v, the likelihood ratio criterion for testing the hypothesis ^ = 
a 2 ^ 0 , where 屯 0 is specified and a 2 is not specified，is 

(13) 

(tr B^'/ P y pN 

Mauchly (1940) gave this criterion and its moments under the null 
hypothesis. 

The maximum likelihood estimator of a 2 under the null hypothesis is 
tr /(pN), which is tr A/(pN) in canonical form; an unbiased estimator 
is tr B ^ [ /[p(N - 1)] or tvA/[p(N-l)] in canonical form [Hotelling 
(1951)]. Then tr /a 2 has the /-distribution with p(N— 1) degrees of 
freedom. 


10.7.3. The Distribution and Moments of the Criterion 

The distribution of the likelihood ratio criterion under the null hypothesis 
can be characterized by the facts that A = A! A 2 and and A 2 are indepen¬ 
dent and by the characterizations of Aj and A 2 . As was observed in Section 
7.6. when X is diagonal the correlation coefficients {〜} are distributed 
independently of the variances {a u /{.N - 1)}. Since A! depends only on {r"} 
and A : depends only on {〜}，they are independently distributed when the 
null hypothesis is true. Let W- \ 2/N , W x = \ 2 / N , W 2 - A!’' From Theorem 

9.3.3. we see that is distributed as n/L 2 乂 ， where X 2>t ..,X p are 

independent and X t has the density /3UI 士 （n — i + l)，|(i 一 1)]，where n = 
N - ). From Theorem 10,4.2 with W 2 - p r V^ /, \ we find that W 2 is dis¬ 
tributed as /7 / Tl ; ^ 2 y^ ; ''(1 —Y } \ where Y 2 ^..,Y p are independent and 
has the density (i{y\ [n(j - Then W is distributed as W X W 2 , where W v 

and W 2 are independent. 
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The moments of W can be found from this characterization or from 
Theorems 93,4 and 10,4A We have 


(14) 

(15) 


It follows that 


(16) 


SW[' 


r^(|n) r p{2 n + h) 

r^^n + h) r p an) 




sw h =p hp 


£jipn) r pil n +h ) 

Y(\pn +ph) I ； (士 n) 


For p = 2 we have 


(17) 


SW h = A h 


r ㈤ 

Y{n + 2h 


yn 


r[^(n + 1 -j)+h 

F [ 士 （《 + 1 _ 


r(«)r(«-i + 2/i) n-i 

T(n + 2h)T(n-l) = n-l + 2h 




1-2 + 2 A 


dz, 


by use of the duplication formula for the gamma function. Thus W is 
distributed as Z 2 , where Z has ^he density (n — l)z n ~ 2 > and W has the 
density - The cdf is 

(18) Pr{W^<>v} =F(w) =w^ (n ^ l) . 

This result can also be found from the joint distribution of / l9 / 2 , the roots of 
(8), The density for /? = 3, 4, and 6 has been obtained by Consul (1967b). See 
also Pillai and Nagarsenkar (1971). 


10.7.4. Asymptotic Expansion of the Distribution 

From (16) we see that the rth moment of W^ n = Z 7 say, is 


(19) 


SZ r = Kp^ n P r 


n^r [士 n(i+ r ) + i(i - o] 
r[{pn(l+r)] 
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This is of the form of (1)，Section 8.5, with 


( 20 ) 


a =p, x k = \n, ^ k = {{\-k), k = l,..,,p, 
b = l, = {np, ill = 0. 


Thus the expansion of Section 8.5 is valid with / = {p{p -f 1) - L To make 
the second term in the expansion zero we take p so 


( 21 ) 

Then 



2p 2 + 2 

6pn 


( 22 ) 


(p + 2)(— l)(p - 2)(2/7 3 + 6j? 2 + 3p + 2) 
2SSp 2 n 2 p 2 


Thus the cdf of W is found from 

(23) Pr{ -2plog Z <z) 

=Pr{ —nplogW^z} 

= Pr{^/< 2 } +w 2 (Pr{^/ +4 



<z\ - ?r[x^ <z}) +0(n -3 ). 


Factors c(n 9 p y e) have been tabulated in Table B,6 such that 
(24) ^{~^c(n, p, e)x] l , (n + i ) -i(e)} - e. 


Nagarsenkar and Pillai (1973a) have tables for W. 


10.7.5，Invariant Tests 

The null hypothesis H:X = a 2 1 is invariant with respect to transformations 
X* = cQX + v, where c is a scalar and Q is an orthogonal matrix. The 
invariant of the sufficient statistic under shift of location is A, the invariants 
of A under orthogonal transformations are the characteristic roots k ， … ， l p ， 
and the invariants of the roots under scale transformations are functions 
that are homogeneous of degree 0, such as the ratios of roots, say 
U/h ， … Invariant tests are based on such functions; the likelihood 
ratio criterion is such a function. 



10,7 TESTING HYPOTHESIS OF PROPORTIONALITY ； SPHERICITY TEST 


437 


Nagao (1973a) proposed the criterion 


(25) 






l(trS ) ： 


「tr S 2 


一 p 


where / = Y,^ x l l /p. The left-hand side of (25) is based on the loss function 
L^(2, G) of Section 7.8; the right-hand side shows it is proportional to the 
square of the coefficient of variation of the characteristic roots of the sample 
covariance matrix S, Another criterion is l x /l p . Percentage points have been 
given by Krishnaiah and Schuurmann (1974), 




h：~P 




10-7.6. Confidence Regions 

Given observations from N(v,^P), we can test ^ - a 2 for any 

specified 中 … From this Family of tests we can set up a confidence region for 
If any matrix is in the confidence region, all multiples of it are. This kind 
of confidence region is of interest if all components of y a are measured in 
the same unit, but the investigator wants a region independent of this 
common unit. The confidence region of confidence ] - e consists of all 
matrices satisfying 


(26) 






where \(e) is the e significance level for the criterion. 

Consider the case of p-2. If the common unit of measurement is 
irrelevant, the investigator is interested in r= and p= • 

In this case 


(27) 


平 - 


少 11屮22(1 — P 一） 


少 11(1 一 P 2 ) 


中 22 




~ Py^w^zz 
少 n 


1 

■ p/r 


p/r 
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The region in terms of r and p is 

(28) 4 ( 心 -- 咕 )(1-〆)^ ，) 

(£» u + - 2p\Zr£>j 2 ) 

Hickman (1953) has given an example of such a confidence region, 

10.8. TESTING THE HYPOTHESIS THAT A COVARIANCE 
MATRIX IS EQUAL TO A GIVEN MATRIX 

10*8.1. The Criteria 

If Y is distributed according to jV(v ， 市 )， we wish to test that V = V 0 , 
where 中 0 is a given positive definite matrix. By the argument of the 
preceding section we see that this is equivalent to testing the hypothesis 
f/ t ; X = /, where 2 is the covariance matrix of a vector X distributed 
according to 2), Given a sample the likelihood ratio crite¬ 

rion is 

_ max,. L(^,/) 

1 一 max^ v L(|x ， 2) ’ 

where the likelihood function is 
(2) 1(^,1) = (2n)~ h ， N \l\ exp 

Results in Chapter 3 show that 

A (27r)-^exp[-lEti(^-x) f (^-x)] 

1 (2ny LiPN \(l/N)A\^' N e- L ^ N 

A = 印 - 

a 

Sugiura and Nagao (1968) have shown that the likelihood ratio test is biased ， 
but the modified likelihood ratio test based on 

(.5) A* = (^)' !l>， 'UI e * - 1,r (IS|e- ,r 5 )^, 


(3) 


where 

㈧ 


a — 1 
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where S = (l/n)A, is unbiased. Note that 

(6) - 鲁 log AJ tr S - log! S\ = 

where L { (I t S) is the loss function for estimating I by S defined in (2) of 
Section 7,8, In terms of the characteristic roots of S the criterion (6) is a 
constant plus 

(7) E h ^ lo 8 Ylh~P = E (A - lo g 之 - 

i=l i*! i=l 

for each i the minimum of (7) is at l t = 1. 

Using the algebra of the preceding section, we see that given as 

observation vectors of p components from N(v ， 平 )，the modified likelihood 
ratio criterion for testing the hypothesis H l : 1 ^ - where 平 0 is specified, 
is 

(8) a 卜 (mw 1 … e+w ’， 

where 

(9) B= £ { y a -y ){ y Q -yy. 

a- \ 

10.8.2. The Distribution and Moments of the Modified Likelihood 
Ratio Criterion 

The null hypothesis ^ : X = / is the intersection of the null hypothesis of 
Section 10.7，//: X = a 2 I, and the null hypothesis a 2 -l given X = a 2 L 
The likelihood ratio criterion for H l given by (3) is the product of (7) of 
Section 10.7 and 

( 10 ) (吳 r ， 一， 

which is the likelihood ratio criterion for testing the hypothesis a 2 - l given 
a 2 L The modified criterion A* is the product of \A \ /(tr A/p)^ n and 

l T .— \ ^ l 2 XrA + lP n ^ 

Pn ) ， 

these two factors are independent (Lemma 10,4,1), The characterization of 
the distribution of the modified criterion can be obtained from Section 
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10*7,3, The quantity trA has the ^ 2 -distribution with np degrees of freedom 
under the null hypothesis. 

Instead of obtaining the moments and characteristic function of A* [de¬ 
fined by (5)] from the preceding characterization, we shall find them by use 
of the fact that A has the distribution n\ We shall calculate 


( 12 ) 




* h 


2P n 


/ … ^(A\^,n)dA 
一/… / U 卜〜 」 Atr / I 中 | S ， n ) 从 


Since 


(13) 

\A\^ nh e^^ huA w(A\^^ n) 


the hth moment of A? is 


j/l| 办 — 尸 — 1 ) g 一全 (t r ^ hA) 

2^'T p [\n(l + h)] 

IS- 7 +hl\ y u ， + " h) \X\ [, T lr ^n) 

I 芝 _ 1 + j( n + n /i)|^| 3(/1 +«/«_P_ 1 ) g— 7 ir(i~ 1 +h I)A 

2> A 12| 沁 r ； [Xl+/z)] 

\I + h^ n+nh) r p {^n) 

•iv| ,4|(2 _1 + hi) 1 ,« + n/ij. 


(14) 


2e_^ \l& h n^r[{(n + nh + l-j)] 

I/ + ASI—n/u [士 (n + i— yj] 


Then the characteristic function of 一 2 log A* is 


(15) 


^ e -2 ， no^r = 


2 e y^ n! |S|~' nf A r[^(n + l-;)-tnr] 
I \I-2it^ n ~ int M — r[^(n + 1 -;)] — 
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When the null hypothesis is true, X = /， and 


(16) 


^> e - 2 iMog At = 



—i pm 

'( 1 - 2 :/) 


- 


P 

n 


r[i(n+i - 川 


This characteristic function is the product of p terms such as 


(i7) Mn 


2e\- ln, 

I 


( 1 - 2,0 ' 




1 - j) -⑽] 

+ 卜川 


Thus - 2 log is distributed as the sum of p independent variates, the 
characteristic function of the jth being (17). Using Stirling’s approximation 
for the gamma function, we have 


(18) 办 (0 〜 2 e^ inl n tnr (l- lit) 


.V(2»«f-«) 


e - [“ 川 -"-in’][|( n + 1 — 乃—如 


(1-2 妁 


e - [办 + i -; )] [ i («-； + 

& (广 1) 


_.U 


3 (n —/ + 1)(1 - 2/r) j 


卜-器) 


As n oo, 0/0 — (1 — 2ir ) 一 ^ which is the characteristic function of \ } 2 
(x 2 with j degrees of freedom). Thus -21og Af is asymptotically distributed 
as EJ =I which is x 2 with HjL'j 、+ 1) degrees of freedom. The 
distribution of A* can be further expanded [Korin (1968), Davis (1971)] as 


(19) Pr{-2plog A*^z} 

= Pr(^ 2 <z) + p^i( Pr U/^ - Pr U/< z l) +0(AT 3 ). 
> 

where 


( 20 ) 

(21) 


2p 2 + 3p - 1 
6N(p + l) J 

p(2p^ + 6p 3 +p 2 - 12p - 13) 

288( p + 1) 
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Nagarsenker and Pillai (1973b) found exact distributions and tabulated 5% 
and 1% significant points, as did Davis and Field (1971)，for p = 2(1)10 and 
n = 6(1)30(5)50,6D ， 120. Table B、7 [due to Korin (1968)] gives some 5% and 
1% significance points of —2log A* for small values of n and p = 2(1)10* 

10.8.3. Invariant Tests 

The null hypothesis /f: 2 = / is invariant with respect to transformations 
X* = QX -f v, where Q is an orthogonal matrix. The invariants of the sujffi- 
cient statistics are the characteristic roots of S 9 and the invariants of 

the parameters are the characteristic roots of 2. Invariant tests are based on 
the roots of S; the modified likelihood ratio criterion is one of them. Nagao 
(1973a) suggested the criterion 

(22) intr(5-/) 2 = inf ： (/,.-l) 2 . 

1 

Under the null hypothesis this criterion has a limiting ^^distribution with 
3 /H p + 1) degrees of freedom. 

Roy (1957), Section 6.4, proposed a test based on the largest and smallest 
characteristic roots l { and l p \ Reject the null hypothesis if 

(23) l p <l or l { > u, 
where 

(24) Pr<u|2=/} = l-e 

and e is the significance level. Clemm ， Krishnaiah, and Waikar (1973) give 
tables of w = 1 //. See also Schuurman and Waikar (1973). 


10,8.4. Confidence Bounds for Quadratic Forms 

The test procedure based on the smallest and largest characteristic roots can 
be inverted to give confidence bounds on qudaratic forms in 2、 Suppose nS 
has the distribution n). Let C be a nonsingular matrix such that 

X - C'C. Then nS* = nC 1 SC ^ 1 has the distribution W(I 7 n). Since /* ^ 
a'S^a/a'a < /* for all a y where /* and If are the smallest and largest 
characteristic roots of 5* (Sections 11.2 and A.2), 

(25) Pr|/ < <u Va ^ oj = 1 - e, 

where 

(26) Pr {/ </* < u} = 1 - e. 
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Let a = Cb. Then a'a^b'C'Cb^b'lb and a'S*a = b'C'S*Cb ^ b'Sb. Thus 
(25) is 

(27) l-e = Pr|/< <u 

=Pr 

Given an observed 5, one can assert 

(28) ^ <b r lb<^ — Vi 




with confidence 1 — e. 

If b has 1 in the ith position and 0’s elsewhere, (28) is s n /u < (t u < s^/L If 
b has 1 in the ith position, 一 1 in the jth position, i ^ y, and 0’s elsewhere, 
then (28) is 


(29) 


s„ + s j; - 2s, 


^ + *- 2 a ；； ^ 


s n S )) ~ ^- s ij 


Manipulation of tnesc inequalities yields 


(30) 




z •句 •. 


We can obtain simultaneously confidence intervals on all elements of D, 
From (27) we can obtain 


(31) 


丄 

b'Sb 

b'Sb 1 

b'Sb 

u 

b r b 

- b r b 

-^7 

b'b 抑 

l 


a’Sa 

b r lb 

1 

u 

min 

a 

a'a - 

b'b 

< -r max 

a 


a r Sa 
a 1 a 


Vi 


P r {l7’〆 入 〆 入 1 < 7 "’〗 


where l { and l p are the largest and smallest characteristic roots of 5 and 入 ! 
and \ p are the largest and smallest characteristic roots of X- Then 

( 32 ) j^ i P ^ A (^) ^ j i i 

is a confidence interval for all characteristic roots of 2 with confidence at 
least \ — e. In Section 11.6 we give tighter bounds on A(£) with exact 
confidence. 
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10.9. TESTING THE HYPOTHESIS THAT A MEAN VECTOR AND 
A COVARIANCE MATRIX ARE EQUAL TO A GIVEN VECTOR 
AND MATRIX 


In Chapter 3 we pointed out that if 屯 is known, ( 夕 一 ( 歹 一 v 0 ) is 
suitable for testing 

(1) H 2 \v = v [) , given ^ 

Now let us combine H r of Section 10.8 and H 2 , and test 

(2) = 屮=屮 0 ， 

on the basis of a sample y^,. ..,y N from A/(v, ^). 

Let 

(3) X = C(Y~v 0 ), 
where 

(4) C^r 0 C' = I, 

Then x u ...,x N constitutes a sample from S), and the hypothesis is 

(5) H:jji = 0, 2 = /. 

The likelihood ratio criterion for H 2 \\i, = 0, given 2 = /, is 

(6) A 2 = e- 

The likelihood ratio criterion for H is (by Lemma 10,3.1) 


(7) 


A = A,A 2 = ( ■^广 




•^4 


The likelihood ratio test (rejecting W if A is less than a suitable constant) is 
unbiased [Srivastava and Khatri (1979), Theorem 10.4.5]. The two factors a { 
and A 2 are independent because Aj is a function of A and A 2 is a function of 
x 9 and A and x are independent. Since 

= Se =,(1 +hy hl> . 


(«) 
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the /ith moment of 入 is 


(9) 


£\ k = £\\£\ h 2 = 


lle_\ ^ Nh 1 

\~^ J (l + h) Llpm+h) 


rJ|(n + M；)] 


under the null hypothesis. Then 

(10) —2 log A = 一 2 log A 】 一 2 log A 2 


has asymptotically the ^^distribution with f— pip + 1)/2 + p degrees of 
freedom. In fact, an asymptotic expansion of the distribution [Davis (1971)] of 
一 2p log 入 ’[s 


(11) Pr{-2plogAsz} 

=Pr U/ 2 ^ z } + ^r ( Pr (^4 <z)-Pr(^ 2 ^^))+0( Ar ? )， 


where 

( 12 ) 

(13) 


1 2p 2 + 9/?-ll 

~ ~]N(p + 3 ) ， 

p(2p A + 18p 3 + 49 p 2 + 36p — 13) 
288( p-3) 


Nagarsenker and Pillai (1974) used the moments to derive exact distributions 
and tabulated the 5% and 1% significance points for p = 2(1)6 and iV = 
4(1)20(2)40(5)100. 

Now let us return to the observations y v ..-, y N . Then 

(14) E( ： V a -‘ v u)’ crc (n) 

a a 

=Ek-v 0 )W( ： y a -v 0 ) 

a 

—tr A + Nx J x 

= tr(B^o') +A^(3 i -v 0 ) , ^ 0 _, (3i-v 0 ) 
and 

(15) \A\ = 

Theorem 10.9.1, Given the p-component observation vectorsy t ^«.., y s from 
N(v 7 市 、 7 the likelihood ratio criterion for testing the hypothesis H : v = v { ^ 
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中 =^ 0 , IS 

(16) ( 云广 iBtl 〜” 1 + 叭卜 v __)， nv 0 )、 

When the null hypothesis is true, —2log 入 is asymptotically distributed as \ 2 
with {p( p + 1) + p degrees of freedom. 

10.10. ADMISSIBILITY OF TESTS 

We shall consider some Bayes solutions to the problem of testing the 
hypothesis 

(1) a = … 

as in Section 10.2, Under the alternative hypott esis, let 

(2) [^U g ] = [(/ + C s C ； )- ， C^,(/ + C g C £ )- 1 ], g=l, …… 

where the p X r K matrix C g has density proportional to \I + C h C g \" 

- 1, the ^-component random vector y {8) has the conditional normal 
distribution with mean 0 and covariance matrix (1//V g )[' - C; (/ ^ + 
C g C r g y [ C g V l given C g9 and (C” y l) ), ••” （ C g ， yw) are independently dis¬ 
tributed. As we shall see, we need to choose suitable integers r】，• 、 ， ， Note 
that the integral of |/+ C g C g \ is finite if n g >p + r g . Then the numera¬ 
tor of the Bayes ratio is 

(3) const]! r r l/ + C„C；|X 

g _ £ j -CC /一 OC 

•e X p|~ 去 L [4)-( ， +<^)- 1 (^ 叫 ， 

\ a = 1 

•(/ + c,c ； )[^-(/ + c g q)- l c^> 
•|/ + c,q|-^|/-c ； (/ + c,c;)- I c,f 

• exp(- ^ g / s) '\l- C；(/+ C g C g y l C<s']y^) dy ⑻ dC g 
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^ .co .ra ( 1 

= constnf … f exp - - ^ x ( a g,, (/+ C 

-® [ [a-l 

^ ]\ 

-2y (i ： ), C g ^ + dy^dC g 

a™ l j 

分 ( 1 N& \ y-CO ,oo 

=const 「 [ 哪 {E f … f 

d I、 z c=-i J 

•exp{- 士 \(yg) _ C^i ⑴ ) 《 y S ) — c g x^) - itrc ； 4 s cj dy ⑻ dC g 
=const Y[ exp{ - ！ [tr/l g +iV g jc ⑷ ' 太叫 )1'| 


Under the null hypothesis let 


(4) |V«U S 卜 [(/+CC')-b'U + CC’r 1 ]， 

where the p 乂 r matrix C has density proportional to |/+ CC r \ ~ ^ rt , n = 
T. q g-l n g , the r-component vector y ls) has the conditional normal distribution 
with mean 0 and covariance matrix (l/N g )[I r - C'(I p + CC’ 广 1 C] _1 given C, 
and 3 > (1) ,.. 3 > (g) are conditionally independent. Note that the integral of 
iJ+CC'l — h is finite if n>p + r. The denominator of the Bayes ratio is 

(5) 

00 00 9 「 / N e 

f …广 fl |/+CC1^exp -y E [o^)-(/ + CC f ) _1 Cy 叫 ’ 

』一 cc I z Q -1 


'(/+CC , )[x^-(/+CC , )" 1 C 3 >^] 

■|/+cc , r^*|/-c , (/+cc , )' l c| 5 


■exp{_ + ； V g ys)’ [/一 c’( CC’) -1 C]y«)} dy {g ^ dC 




For invariance we want r g = r. 

The binding constraint on the choice of •“， is r g 孓 n s — p，g ― 
1 ，， " ， 9 , It is possible in some special cases to choose q，，，.， 〜 so that 
(r 1? …， ％) is proportional to (N v … ， N q ) and hence yield the likelihood ratio 
test or proportional to and hence yield the modified likelihood 

ratio test, but since r Vt .. 7 r q have to be integers, it may not be possible to 
choose them in either such way. Next we consider an extension of this 
approach that involves the choice of numbers t l7 ... 7 t q7 and t as well as 
and r. 

Suppose 2(/7 — 1) 〈” r and take > p. Let t g be a real 

number such that 2p — 1 <r g -^- t g -{- p and let r be a real number 

such that 2p-l<r-h/-hp<n-hl. Under the alternative hypothesis let the 
marginal density of C g be proportional to \C g C r g \ ^s|/-h C^C^r g- 
1 ， … ， g，and under the null hypothesis let the marginal density of C be 
proportional to |CC’| 士 ’1/+ CC’I — 知 . (The conditions on q ，…，〜， and t 
ensure that the purported densities have finite integrals; see Problem 10.18.) 
Then the Bayes procedure is to reject the null hypothesis if 


⑺ 


l4l 1(r+,) 

Ul = ,\A g \" (rs+，s) 
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For invariance we want t = If t 口 … ， t (f are taken so q + ~ = kN R and 

p — 1 <kN g <N g — p 7 g = 1 ， … for some k 9 then (7) is the likelihood ratio 
test; if r g + t g = kn g and p — 1 < kn g 〈〜 + 1 — p，g = 1” for some A% 
then ⑺ is the modified test [i.e M (p — l)/min g N g <k <1 - p/min g N^], 

Theorem 10.10.1. * If 2p < N g + 1 ， then the likelihood ratio 
test and the modified likelihood ratio test of the null hypothesis (1) are admissible. 

Now consider the hypothesis 

( 8 ) … 5 ^= …— 

The alternative hypothesis has* been treated before. For the null hypothesis 
let 

(9) [#)，、‘] = [(/ + CC , )Cy,(^ + CC f )" 1 ], 

where the pXr matrix C has the density proportional to |/+CCT+ l ， v_u 
and the r-component vector y has the conditional normal distribution with 
mean 0 and covariance matrix (1 /jV)[/- C'(/ + CC0" l C]" 1 given C. Then 
the Bayes procedure is to reject the null hypothesis (8) if 

1：^4 + 44(产-30(产)-301\_ 

() 

If 2p <N g + 1 ， g = 1， • ， ♦ ， 1 the prior distribution can be modified as before 
to obtain the likelihood ratio test and modified likelihood ratio test. 

Theorem 10.10-2. If 2p < N g -^~ 1, g = l,. .. f q } the likelihood ratio test 
and modified likelihood ratio test of the null hypothesis (8) are admissible. 

For more details see Kiefer and Schwartz (1965). 


10.11. ELUPTICALLY CONTOURED DISTRIBUTIONS 
10.11.1. Observations Elliptically Contoured 

Let xf )，a = be N g observations on X {K) having the density 

(1) |A g r^[(j：- v U) )], 

where 忒 [(A.—v ⑻ —v( g) )] 2 = / 尺 : < co, 1 ， … ， q. Note that the 
same function g0) is used for the density in all q populations. Define N, 
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1 ， — q 7 and A by (1) of Section 10.2* Let (l/n g )A g7 where n g = 
- 1, and S = (l/n)A, where n = 

Since the likelihood ratio criterion is invariant under the transforma¬ 
tion 尤 ⑻ =+ v 15 )，under the null hypothesis we can take 2^ = ••，= 

=I and v {l) = :" • = = o. Then 

H 

(2) -21og A,= £^Iog|i gn | - Nlogltj 

g * 1 

t^iog|/+(i sn -/)|-Mo g |/ + (i w -/)| 

A r=J 1 

^ N g tr (^ s n - 7 ) - 5 tr (^ g n - + °p( N g 3 ) 

-N tr(2 w -/) - 5 tr(i u -/) +O p (AT 3 )j 
£ ^ s tr(S en -/) - \N tr -ff + O p (N^ 3 ) 

gH 

=S L \[ vec (^n-^)[ v ec(^a-^) 

1 

By Theorem 3.6.2 

(3) v /ATvec(5 1! - / p ) 4 A/ [0, ( k + 1) (7^,. + K pp ) + k vec I p (vec , 

and N g t gSl , g= 1 ，一 ” g, are independent. Let N g ^k g N, 

= U and let N -»cc. In terms of this asymptotic theory the limiting 
distribution of vec(S t - /) ? .. M vec(S^ - /) is the same as the distribution of 
y [V \.... y [q) of Section 8.8, with 2 of Section 8.8 replaced by (k+ l )(/ r 2 + 
K pp ) + k vec/^Cvec/^y. 

When 2 = f ， the variance of the limiting distribution of ― 1) is 

3k + 2; the covariance of the limiting distribution of - 1) and 

~ 0, /, is k; the variance of 屯 )， f #/， is k + 1; the set ( 屯 ) 

-1) •… ， v/N(^ ； -l)is independent of the set (sg))，f #/; and the 5^ g) , i < 
arc mutually uncorrelated (as in Section 7*9-lX 
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Let y g = vecC2 gn - /) and y = vec(X w - /). Then y = and 

⑷ _ 21 ogA= 〗 ( 心 -3 s ) 

= 4 L^ s (y g -y){y g ~y)' 

g m 1 

= itr L^M-Nyy')- 

Let Q be a ^ X 9 orthogonal matrix with last column {^JN X /N, . •. ， y^/^V )’• 
Define 

(5) Oi，."，、) = (v^T 》 i ， … ， ■/^ 加 - 

Then w q = yj~Ny and 

(6) E N s y g y' g - 順 = 二 • 

In these terms 

(7) -21ogA 1 = i 9 I：^ g + 0 ；) (N- 3 ) J 

裒 =1 

and w l7 ... y w q _ [ are asymptotically independent, w g having the covariance 
matrix of }/Ny g ; that is, (k + lXl^ + K pp ) + k vec l p (vec I p \ Then w’ g w g 口 
(… //)) 2 = 4 - 2L i< j(wjf ) ) 2 . The covariance matrix of 

… ， ivg) is 2 (k + I)/# + Kee ， where e = (l”" ， iy. The characteristic 
roots of this matrix are 2 (k 4 - 1 ) of multiplicity p — 1 and a single root of 
2 (k + 1) 4 - pK, Thus E/L «/)) 2 has the distribution of 2 (k + ^)Xp-\ + 
[2(k + 1) ^ The distribution of 2L , </ (^ ) ) 2 is 2 (k 4- l)^ (p -i )/2 « 

Theorem 10.11.1. When sampling from (1) and the null hypothesis is tme 7 

( 8 ) -2log A, (k + \)x^ q -^ {p - {){p+2)/z + [(«+!) +PK/2] Xq-i- 

When k= 0, -21og A ,^ ( ^_ 1)；)(/)+1)/2 is in agreement with (12) of 
Section 10,5. The validity of the distributions derived in Section 10.4 depend 
on the observations being normally distributed; Theorem 10.11.1 shows that 
even the asymptotic theory depends on nonnormality. 
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The likelihood criteria for testing the null hypothesis (2) of Section 10.3 is 
the product 入 ^ or V { V 2 * Lemma 10.4.1 states that under normality V x and 
V 2 (or equivalently and A 2 ) are independent. In the elliptically contoured 
case we want to show that log^ and logK 2 are asymptotically independent 

I^emma 10,11*1. Let A x — n v S { and A 2 = n 2 S 2 be defined by (2) of Section 
10.2 with Si = S 2 =/* Then A l (A l +A 2 )~ { and A x +A 2 are asymptotically 
independent. 

Proof Let {\^~n~ g XA g - n g I) = 1,2. Then 


(9) 

^A^A.+A.y 1 



»l\ / »1»2 
(«1 +« 2) 2 


+ 0 / 1 ), 


( 10 ) _ 

V n l + n 2 [^1 +^2~ ("l +«2)^] = j ~ ni n +n 2 W l + ]J n~+l^ W 2 + °p( l ) - 

Then 



By application of Lemma 10.1L1 in succession to A y and A x + A 2y to 
A x +A 2 and A x +A 2 + etc,, we establish that A l A'\A 2 A^ 1 7 .. t7 A q A~ l 
are independent of A=A L -h +A q . It follows that V { and V 2 are asymptot¬ 
ically independent. 

Theorem 10.11.2. When 5^ = = 2 g ami jx (1) = = yS 8 \ 


(12) - 2 log Aj A 2 = — 2 log _ 2 log A 2 

4 (k + l)^_ 1)(p _ 1)(p+2)/2 + [(k+ 1) +P«/2] xl-X +^ 2 (q -I)- 
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The hypothesis of sphericity is that 1 — a 2 1 (or A — 入 /)• The criteron is 
A l A 2 , where 


1^41 

n/L I Oj/ 




The first factor is the criteron for independence of the components of X f and 
the second is that the variances of the components are equal. For the first we 
set q = p and /?,== 1 in Theorem 9.10, and for the second we set q =p and 
/? = 1. Thus 


(U) —21og( A! A 2 ) 乂 （1 + k)x^i + K 3k+ 2 ) Xp- 1 < 


10.11.2. Elllptically Contoured Matrix Distributions 
Consider the density 

(15) fllA,l -^ /2 g[r j ： A; 1 (X ⑷-- 

=n |A ? l" W * /2 g tr I ： A~ l A g + 1 ： N g (x^-v^)' A^(x^-v^). 

s ~ ^ ^ * 1 g 祖 i 

In this density (A gt x g ) t g- 1 ，•，.，％ is a sufficient set of statistics, and the 
likelihood ratio criterion is (8) of Section 10.2, the same as for normality 
[Anderson and Fang (1990b)]. 

Theorem 10.11.3. Let f(X) be a vector-ualued function of X - 
(X (l \.. ►, X (q) ) (p xN) such that 

(16) /( X (1) + v (1 i eV ,， …，声 + 沁) =/(，， …， X ^) 

for every (v (I) ,v (?) ) and 

(17) f(CX (l \^.XX (q) ) 


for every nonsingular C. Then the distribution off(X) where X has the arbitrary^ 
density (15) with A j = is the same as the distribution off(X) where X 

has the normal density (15). 
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The proof of Theorem 10,11.3 is similar to the proof of Theorem 4.5.4. 
The theorem implies that the distribution of the criterion of (10) of 
Section 10.2 when the density of X is (15) with A t = A^ is the same as 
for normality* Hence the distributions and their asymptotic expansions are 
those discussed in Sections 10*4 and 105. 

Corollary 10.11.1. Let f(X) be a vector-valued function ofX(pX N) such 
that 

(IS) f(X + ve' N )=f(X) 

for every v and (17) holds» Then the distribution of f(X\ where X has the 
arbitrary density (15) with A j … =K q and v ⑴ = … = v ⑷， is the same as 
the distribution off{X) t where X has the normal density (15)* 

If follows that the distribution of the criterion 入 of (7) or K of (11) of 
Section 10.3 is the same for the density (15) as for X being normally 
distributed. 

Let A’ (p x A/) have the density 

(19) \\r N/2 g[trA- l (X-ve' N )(X~ve' N y\. 

Then the likelihood ratio criterion for testing the null hypothesis A^= A/ for 
some A > 0 is (7) of Section 10.7, and its distribution under the null 
hypothesis is the same as for X being normally distributed. 

For more detail see Anderson and Fang (1990b) and Fang and Zhang 
(1990), 

PROBLEMS 

10‘1* (Sec. 10.2) Sums of squares and cross-products of deviations from the means 
of four measurements are given below (from Table 3,4), The populations are 
Ins venicolor (I), Iris seiosa (2\ and Iris uirginica (3); each sample consists of 50 
observations*. 


r 13.0552 

4.1740 

8.9620 

2.7332 、 

4」 740 

4.8250 

4.0500 

2.0190 

1 8.9620 

4.0500 

10.8200 

3.5820 

1 2.7332 

2.0190 

3.5820 

1,9162, 

’ 6.0882 

4.8616 

0.8014 

0.5062) 


4.8616 

7.0408 

0.5732 

0.4556.1 


0.8014 

0.5732 

1,4778 

0,2974 

y 

、 0.5062 

04556 

0.2974 

0.5442 } 


’ 19.8128 

4,5944 

14.8612 

2.4056 、 

4.5944 

5.0962 

3.4976 

2.3338 

14,8612 

3.4976 

[4.9248 

2.3924 

( 2.4056 

2.3338 

2.3924 

3,6962 j 



PROBLEMS 


455 


(a) Test the hypothesis =S 2 at the 5% agmficance level. 

(b) Test the hypothesis 5^ = S 2 = 2 3 at the 5% significance level. 

10.2. (Sec. 10.2) 


(a) Let g = 1 ， ... ，分 ， be a set of random vectors each with p components* 
Suppose 


Let C be an orthogonal matrix of order q such that each element of the 
last row is 

c q h=^/>R - 

Define 

z(s)= E s= i. . 

h «= 1 

Show that 

if and only if 

= 2 2 = … = 

(b) Let X^ g \ a= 1,..., A/, be a random sample from /V(pji ⑻， 

Use the result from (a) to construct a test of the hypothesis 

based on a test of independence of 2 ⑼ and the set Z (9_l) . Find 

the exact distribution of the criterion for the case p = 2. 

103. (Sec. 10.2) Unbiasedness of the modified likelihood ratio test of cr, 2 - o^ 2 . Show 
that (14) is unbiased. [Hint: Let G =n'F/n 7 , r 】 (t}/<t^ and c { < c 2 be the 
solutions to G^ f，1 (1+G) _ ^ (… 十 " 2J = 众 ， the critical value for the modified 
likelihood ratio criterion. Then 


Pi*{AcceptanceI ctj 2 /<j 2 2 = /■} = const j (1 +rG)” +/l2) dG 

=const 广 响 广 l (l +H)~ llln,+n2) dH. 

Show that the derivative of the above with respect to r is positive for 0 < r < 1 ， 
0 for r = 1 7 and negative for r> 1.] 
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10.4. (Sec 10.2) Prove that the limiting distribution of (19) is where /= 
\p{ p + 1X^ "- 1). [Hint: Let 2 = /‘ Show that the limiting distribution of 
(19) [s the limiting distribution of 

\ E E n g {s\p~s „) 2 ^ E E 〜 ) 2 ， 

* 1 i<j g=<l 

where S ⑻ = ( 咐 ))，S = ( 5 f/ ), and the - S ( jX i j\ are independent in 

the limiting distribution, the limiting distribution of Jn^(s\p - 1) is M0,2 )， 
and the limiting distribution of \ ^ <IS /V(0 ， 1).] 

10.5. (Sec. 10.4) Prove (15) by integration of Wishart densities. [Hint: £V^ = 

忒―知 can be written as the integral of a constant times 
Ml ~ 知 +hn K \ Integration over gives a constant 

times 

10,6» (Sec. 10.4) Prove (16) by integration of Wishart and normal densities. [Hint ： 
L q gm] N g {x itl) -x)(x^ -x) r is distributed as 只 : 卜,外 Use the hint of Prob¬ 
lem 10.5.] 

10.7. (Sea 10.6) Let x、*’' …， be observations from M|Jt ⑺， 2J ， v— 1,2, and 
let A v =Ux^ ] -x {v) )\ 

(a) Prove that the likelihood ratio test for //:2!=2 2 is equivalent to 
rejecting H if 

1 雜 U c 

\A { + 破 

(b) Let d}, be the roots of ! 2,- 入 S 2 | = 0, and let 



Show that T is distributed as '\B 2 {/ \B { +B 2 | 2 , where is dis¬ 
tributed according to W(D 2 ^ N — \) and B 2 is distributed according to 
W(I, N - 1)* Show that T is distributed as \ DC { D\ * | C ? J /\DC { D 4 - C 2 | 2 , 
where C, is distributed according to W{I,N ~ 1). 

10.8. (Sec. 10.6) For p = 2 show 

+ B — ’ （ n 1 - l,n 2 - l)u(w 2 )/ n f b x- 2n ^ /n (l - Xi y n,/ " dx^ 

J a 

+ 1 -4(«! — 1，《 2 _ 1)， 
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where a <b are the two roots of jc^(1 -x { y i2 — v /n r \ [Him: This 

follows from integrating the density defined by (8)J 

10,9. (Sea 10.6) For p — 2 an<l — say, show 

Pr{K, <i；} 

I _i_ \/1 — 

= 一 1， m — 1) + 2B~ 1 (m — 1 ， m — I/rr,> los. - -y. ^ — 

^ I - VI 


where a = 士 [1 - ^1 - 4y I/m ]. 

10.10. (Sec 10.7) Find the distribution of W for p =2 under the null hypoihesis (a) 
directly from the distribution of A and (b) from the disiribuuon of the 
characteristic roots (Chapter 13V 

10.11. (See. 10.7) Let x,,*.be a sample from X). Whtu is the likelihood 
ratio criterion for testing the hypothesis jjl 泛众 (ji 0 , 2 = ^ 2 X 0 . where \i u and X (l 
are specified and k is unspecified? 

10.12 - (Sec. 10.7) Let ..., be a sample from Xil and x\ Zl . 

be a sample from 2 2 ). What i.s the likelihood ratio criterion for icsting 

the hypothesis that Xj where k is unspecified? Whal is ihc likelihood 

ratio Criterion for testing the hypothesis that and Xj = 

where k is unspecified? 

10.13. (Sec. 10.7) Let x a of p components, 1..,., iV, be observations from 
M|i, X). We define the following hypotheses: 

//：|1 = 0, X-fc 2 X 0> 

H, ： X^k 2 X a , 

// 2 ; |i = 0, given that S = /c J X 0 . 

In each case k 2 is unspecified, but X 0 is specified、Find ihc liWclihocul raiio 
criterion A 2 for testing f/-». Give the asymptotic distribution of — 2 log A : 
under // 2 . Obtain the exact distribution of a suitable monotonic function of A : 
under 

10.14 - (Sec. 10.7) Find the likelihood ratio criterion A for testing H of Problem 
10.13 (given x' ， … 、 x N 、、What is the asymptotic distribution of — 2 log A unitor 

m ^ 

10.15. (Sec. 10.7) Show that A = \ 入 where A is defined in Problem I0J4, A : 
defined in Problem 1(X13, and A, is the likelihood ratio criterion for H x in 
Problem 10.13. Are 入！ and A 2 independently distributed under HI Prwe your 
answer. 
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10 . 16 . (Sec 10.7) Verify that 1 has the 义 2 -dL、tribution with p(/V - 1) dc ， 

grees of freedom. 

10 . 17 . (Sec 10.7,1) Admissibility of sphericity !esi • Prcm that the likelihood ratio test 
of sphericity is admissible. [Him; Under the null hypothesis let 2 = [1/(1 -f 
T 7 : )]/‘ and let ” have the density (1 + rfY 去"#(” 2 )广 

10 . 18 . (Sec. 10,10.1) Show that fcr r>p 

/ … / /+ n^ <c ° 

if 2p — l</ + r+ /?<rt + L [ Him: \A\ /\I +A\ <1 if A is positive semidefi- 
nhe. Also, has the distribution of Xr m x}-\ *** X^ P +i if 工 

are independently distributed according to M0, /).] 

I0j9» (Sec. 10J0.I) Show 

厂…厂 ICCl^e'^^ '^'dC-constUr 

where C is pX r. [ Hint: CC' has the distribution 1^(.4 _1 ,r) if C has a density 
proportional to e— lrrvK .] 

10 . 20 . (Sec. 10*10.1) Using Problem 10,18, complete the proof of Theorem 10.10.1. 
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Principal Components 


ll.L INTRODUCTION 

Principal components are linear combinations of random or statistical vari¬ 
ables which have special properties in terms of variances. For example, the 
first principal component is the normalized linear combination (the sum of 
squares of the coefficients being one) with maximum variance. In effect, 
transforming the original vector variable to the vector of principal compo¬ 
nents amounts to a rotation of coordinate axes to a new coordinate system 
that has inherent statistical properties. This choosing of a coordinate system 
is to be contrasted with the many problems treated previously where the 
coordinate system is irrelevant. 

The principal components turn out to be the characteristic vectors of the 
covariance matrix. Thus the study of principal components can be considered 
as putting into statistical terms the usual developments of characteristic roots 
and vectors (for positive semidefinite matrices). 

From the point of view of statistical theory, the set of principal compo¬ 
nents yields a convenient set of coordinates, and the accompanying variances 
of the components characterize their statistical properties. In statistical 
practice, the method of principal components is used to find the linear 
combinations with large variance* In many exploratory studies the number of 
variables under consideration is too large to handle. Since it is the deviations 
in these studies that are of interest, a way of reducing the number of 
variables to be treated is to discard the linear combinations which have small 
variances and study only those with large variances. For example, a physical 
anthropologist may make dozens of measurements of lengths and breadths of 
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each of a number of individuals，such measurements as ear length，ear 
breadth, facial length, facial breadth, and so forth. He may be interested in 
describing and analyzing how individuals differ in these kinds of physiological 
characteristics. Eventually he will want to explain these differences, but first 
he wants to know what measurements or combinations of measurements 
show considerable variation; that is, which should have further study. The 
principal components give a new set of linearly combined measurements. It 
may be that most of the variation from individual to individual resides 
in three linear combinations; then the anthropologist can direct his study to 
these three quantities; the other linear combinations vary so little from one 
person to the next that study of them will tell little of individual variation. 

Hotelling (1933)，who developed many of these ideas，gave a rather 
thorough discussion. 

In Section 11.2 we define principal components in the population to have 
the properties described above; they define an orthogonal transformation to 
a diagonal covariance matrix. The maximum likelihood estimators have 
similar properties in the sample (Section 11.3). A brief discussion of compu¬ 
tation is given in Section 11.4, and a numerical example is carried out in 
Section 11.5. Asymptotic distributions of the coefficients of the sample 
principal components and the sample variances are derived and applied to 
obtain large-sample tests and confidence Jntervak for individual parameters 
(Section 11.6); exact confidence bounds are found for the characteristic roots 
of a covariance matrix. In Section 11.7 we consider other tests of hypotheses 
about these roots. 


11.2. DEFINITION OF PRINCIPAL COMPONENTS 
IN THE POPULATION 

Suppose the random vector X of p components has the covariance matrix X. 
Since we shall be interested only in variances and covariances in this chapter, 
we shall assume that the mean vector is 0. Moreover, in developing the ideas 
and algebra here，the actual distribution of X is irrelevant except for the 
covariance matrix; however, if X is normally distributed, more meaning can 
be given to the principal components. 

In the following treatment we shall not \xie the usual theory of characteris¬ 
tic roots and vectors; as a matter of fact, that theory will be derived implicitly. 
The treatment will include the cases where 2 is singular (i.e., positive 
semidefinite) and where 2 has multiple roots. 

Let p be a p-component column vector such that p r p — 1. The variance of 
P X is 


(i) 


<^{p r jt) 2 = /pjcrp = |iS}4. 
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To determine the normalized linear combination ^ f X with maximum vari¬ 
ance, we must find a vector p satisfying p'p = 1 which maximizes (1). Let 

(2) 4>=P2P- 入 (P ， P-l)-LA%/3 ; -A(L/3r-l), 

where 入 is a Lagrange multiplier. The vector of partial derivatives ( 吨 , 艰 ) 
is 


(3) = 2Ap 

Lby Theorem A.4.3 of the Appendix). Since p'2p and p'p have derivatives 
everywhere in a region containing p'p = 1, a vector P maximizing PIP 
must satisfy the expression (3) set equal to 0; that Is 

(4) (X-A/)p-0. 


In order to get a solution of (4) with p'p = 1 we must have 2 - A/ singular ； 
in other words, A must satisfy 

(5) 12 - A/| =0. 

The function 12 — AjT| is a polynomial in A of degree p. Therefore (5) has p 
roots; let these be > A 2 > ••• > A〆[p’ complex conjugate in (6) proves A 
real.] If we multiply (4) on the left by p ; , we obtain 


(6) Ap p = A. 

This shows that if p satisfies ⑷ (and p f p = 1), then the variance of ^ f X 
[given by (1)] is A. Thus for the maximum variance we should use in (4) the 
largest root A,. Let p (I) be a normalized solution of (2 - Aj/)p ^ 0, Then 
U x = p ⑴’尤 is a normalized linear combination with maximum variance. [If 
2 — An jT is of rank p — 1 、 then there is only one solution to (2 - A,/)p = 0 
and p’p = 1.] 

Now let us find a normalized combination that has maximum vari¬ 
ance of all linear combinations uncorrelated with U { . Lack of correlation 
means 

(7) 0 = S^ l XU v - p (l) = p 2P( l) = p ⑴ 

since ⑴. Thus p AMs orthogonal to U in both the .statistical sense 

(of lack of correlation) and the geometric sense (of the inner product of the 
vectors p and p ⑴ being zero). (That is, AfP ⑴ = 0 only if p f p (I) =0 when 
An # 0, and # 0 if 2 # 0; the case of 2 = 0 is trivial and is not treated.) 
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We now want to maximize 

(8) cf> : = pXp-A(p ( p-1) — 2 〜 p f 2p ( 1 )， 

where A and v v are Lagrange multiplers. The vector of partial derivatives is 

(9) ^ = 22p-2Ap-2^,2 ： P l， \ 

and we set this equal to 0. From (9) we obtain by multiplying on the left by 

p ⑴， 

(10) 0 = 2p ( 屮 — 2 入 — 2 〜 p (… Xp (1) = 一 2v 入、， 

by (7), Therefore, ' = 0 and p must satisfy (4)，and therefore A must satisfy 
(5). Let 入 (2) be the maximum of A 【， . ， • ，入 p such that there is a vector 卩 
satisfying — 入 ( 2) J)p = 0， p p - 1, and (7); call this vector p (2) and the 
corresponding linear combination U 2 - (It will be shown eventually 

that A (2) = A 2 . We define A (1) = A r ) 

This procedure is continued; at the (厂 + l)st step, we want to find a vector 
p such that ^ X has maximum variance of all normalized linear combina¬ 
tions which are uncorrelated with U' ， … ， U n that is, such that 

(11) 0 = = P1 ： P (0 = A (0 p 

We want to maximize 

r 

(12) 4> r + l = P Xp - A(PP - 1) - 2 E SP (，) , 

i ** l 

where 入 and v r are Lagrange multipliers. The vector of partial 

derivatives is 

(13) -^1^- = 2Xp - 2Ap - 2 E 

and we set this equal to 0, Multiplying (13) on the left by pO, we obtain 

(14) 0 = 2p ⑺， Dp — 2 入 p ⑴ , p- 

If \ {j) ¥= 0, this gives -2i/ ; A (;) = 0 and v } — 0. If A (;) = 0, then 2p (;) = 入 (j)P (;) 
= 0 and the yth term in the sum in (13) vanishes Thus p must satisfy (4), and 
therefore A must satisfy (5), 
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Let A (r + l) be the maximum of \ p such that there is a vector p 

satisfying (2 — A (r+I) /)p = 0, p’p’ = 1， and (11 )； call this vector P (r+l) , and 
the corresponding linear combination t/ r + l = P (r + 1 ) f 兄 If A (r + l) = 0 and 
A ⑺ — 0, ; # r + 1, then p (;), 2p (r+l) = 0 does not imply p (;)f p (r + l) — 0. How¬ 
ever, P (r+l) can be replaced by a linear combination of p (r + l) and the p (y) ，s 
with A ⑺ ’s being 0, so that the new p (r + l) is orthogonal to all P (;) , ; — 1,..., r. 
This procedure is carried on until at the (m + l)st stage one cannot find a 
vector p satisfying p'p = 1, (4), and (11). Either m=p or m<p since 
P (l) ， …， p (m) must be T inearly independent. 

We shall now show that the inequality m <p leads to a contradiction. If 
m <p there exist p — m vectors，say + such that == 0, 

e\ej = d tJ . (This follows from Lemma A.4.2 in the Appendix.) Let 
(^ m + 1 ，- •_，％) = £• Now we shall show that there exists a(p - m)‘component 
vector c and a number 6 such that Ec = Ec f e t is a solution to (4) with 入 =0. 
Consider a root of \E f ^E — 6I\ =0 and a corresponding vector c satisfying 
E f ^Ec = 6c. The vector 艺 Ec is orthogonal to P ⑴， … ， P (m) (since p ⑴ 

== 0) and therefore is a vector in the space 
spanned by + 1 ,..., ^ and can be written as Eg [where g is a {p - my 
component vector]. Multiplying y,Ec = Eg on the left by E\ we obtain 
E'^Ec^E'Eg =g. Thus g = 6c, and we have S(£c) = 6(Ec). Then (EcYX 
is uncorrelated with and thus leads to a new p( m + l) . 

Since this contradicts the assumption that m <p, we must have m=p. 

Letp = (p ⑴ … p ⑻) and 


( 15 ) 


A (l) 0 0 

0 A ⑺ … 0 

A = . ■ 


0 0 


\p) 


The equations 2p (r) = 入 ㈡ P (r ) can be written in matrix form as 


(16) 2P = PA, 

and the equations 卩 (r) ’p (r) = 1 and p (r) ’p ⑴； 0， can be written as 


(17) 

From (16) and (17) we obtain 


(18) 


P2P = A. 
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From the fact that 

(19) |2-A/| =|p r |-l2-A/|-|Pl 

=|p f 2 ； p-Ap f p| = IA-A/I 
= n(A (I) -A) 

we see that the roots of (19) are the diagonal elements of A; that is, 

V) = 入 (2) = 入 2,… ， 广入， 

We have proved the following theorem: 

Theorem 11*2.1. Let the p-component random vector X have SX— 0 and 
SXX f = X. Then there exists an orthogonal linear transformation 

( 20 ) U=^X 

such that the covariance matrix of U is SUIJ' = A and 

f 0 … 0 

0 A 2 … 0 

(21) A =:: ； 

• • • • 

0 0 \ p 

where 〜 之 ^> A p > 0 are the roots of (5). The rth column of P, p ⑺， 
satisfies (2 - A r /)p (r) —0. The rth component of U ， U r = p (r)/ X, has maximum 
variance of all normalized linear combinations uncorrelated with U l7 ..., U r _ l . 

The vector U is defined as the vector of principal components of X, It will 
be observed that we have proved Theorem A.2.1 of Appendix A for B 
positive semidefinite, and indeed, the proof holds for any symmetric B. It 
might be noted that once the transformation to …， has been made, 
it is obvious that U { is the normalized linear combination with maximum 
variance, for if U* = Hc { U i7 where Ec? — 1 (f/* also being a normalized 
linear coirbination of the X ’$，then Var(f/*) = Ec^A, - - 

(since C| = 1 — T^cf )， which is clearly maximum for cf = 0, / = 2, • • • ， /?• 
Similarly, U 2 is the normalized linear combination uncorrelated with U l 
which has maximum variance (f/* = being uncorrelated with imply¬ 
ing c x =0); in turn the maximal properties of are verified. 

Some other consequences can be derived. 

Corollary 11*2.1. Suppose A, + l ; … = 入 r (r.e., v is a root of 
multiplicity m )； then — vl is of rank p — m. Furthermore p* ~(p (r+l) ••• 
P (r+/n) is uniquely determined except for multiplication on the right by an 
orthogonal matrix. 
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Proof. From the derivation of the theorem we have (2 — — 0, 

i^r + + m; that is，are m linearly independent 

solutions of (X ^ i//)p = 0. To show that there cannot be another linearly 
independent solution, take p ( ° v where the x t are scalars. If it is a 

solution, we have p v0 - £(Ex^p 0) ) = Ip 0) = E-r ; p <M . Since vx\ = 
A〆,，we must have — 0 unless i = r + 1,..,, r-I- m. Thus the rank is p — m. 

If P* is one set of solutions to (2 — j//)p = 0, then any other set of 
solutions are linear combinations of the others, that is, are for A 
nonsingular. However, the orthogonality conditions — I applied to the 

linear combinations give I = (^Ay(^A) = A f ^ r ^A= A'A, and thus A 
must be orthogonal. ■ 

Theorem 11.2.2. An orthogonal transformation V — CX of a random vector 
X leaves invariant the generalized variance and the sum of the variances of the 
components. 

Proof Let <^X = 0 and ^XX r ^1. Then and = CIC\ The 

generalized variance of V is 

(22) |CXC| -|C| -IXI -|C| -IXI-ICCI =121. 

which is the generalized variance of X. The sum of the variances of the 
components of V is 

(23) E ^V, 2 = ^(CXC-) = tr{lC'C) = tr(I/) = tr I = ^ SX;. ■ 

Corollary 1L2*2. The generalized variance of rhe vector of principal compo¬ 
nents w the generalized variance of the original vector, and the sum of the 
uariances of the principal components is the sum of the uariances of the anginal 
uariates. 


Another approach to the above theory can be based on the surfaces of 
constant density of the normal distribution with mean vector 0 and covari¬ 
ance matrix 2 (nonsingular). The density is 


(24) 


1 _ 


and surfaces of constant density are ellipsoids 


(25) 


x — C. 


A principal axis of this ellipsoid is defined as the line from -y to y, where y 
is a point on the ellipsoid where the squared distance x x has a stationary 
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point. Using the method of Lagrange multipliers, we determine the stationary 
points by considering 

(26) 中 = x f x — Ax'S' l x, 

where 入 is a Lagrange multiplier. We differentiate ip with respect to the 
components of x, and the derivatives set equal to 0 are 

(27) |^ = 2x-2AS- I x = 0, 
or 

( 28 ) m 

Multiplication by S gives 

(29) Xx— Ax. 

This equation is the same as (4) and the same algebra can be developed. 
Thus the vectors p ⑴，…， p (p) give the principal axis of the ellipsoid The 
transformation u == p’x is a rotation of the coordinate axes so that the new 
axes are in the direction of the principal axes of the ellipsoid. In the new 
coordinates the ellipsoid is 

u 2 

(30) Ex = C 

Thus the length of the ith principal axis is 2{KC. 

A third approach to the same results is in terms of planes of closest fit 
[Pearson (1901)3. Consider a plane through the origin, a r x = 0, where a'a = 
1. The distance of a point x from this plane is ot r x. Let us find the 
coefficients of a plane such that the expected distance squared of a random 
point X from the plane is a minimum, where <^X = 0 and SXK' = 2. Thus 
we wish to minimize S{cL f X) 2 = Scl'XX'cl = ot'Sot, subject to the restric¬ 
tion ot’o ： 二 1. Comparison with the first approach immediately shows that the 
solution is a. = p (/7) . 

Analysis into principal components is most suitable when all the compo¬ 
nents of X are measured in the same units. If they are not measured in the 
same units, "he rationale of maximizing relative to pp is question¬ 

able; in fact，the analysis will depend on the various units of measurement. 
Suppose A is a diagonal matrix, and let Y= AX. For example, one compo¬ 
nent of X may be measured in inches,and the corresponding component of Y 
may be measured in feet; another component of X may be in pounds and the 
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corresponding one of Y in ounces- The covariance matrix of Y is <^YY r = 
ZAAX’A = AS A = 屯， say. Then analysis of Y into principal components 
involves maximizing <^(y r Y) 2 = 7^7 relative to 7 、 and leads to the 
equation 0 =( 屯一 vl)y - (AXA - vl)y, where v must satisfy |- j//| = 
0. Multiplication on the left by A " 1 gives 

(31) 0 = (2- !/A- 2 )(A 7 ). 

Let Ay == a; that Is, y r Y- y r AX= a'X. Then (31) results from maximizing 
々 (ot'AO 2 = a'Sot relative to ot r A _ 2 a + This last quadratic form is a weighted 
sum of squares, the weights being the diagonal elements of A -2 . 

It might be noted that if A " 2 is taken to be the matrix 

f CT|| 0 … 0 

0 ct 22 … 0 

(32) A ~ 2 - • ， 

； 0 0 "• a PPi 

then Is the matrix of correlations. 

1U. MAXIMUM LIKELIHOOD ESTIMATORS OF THE PRINCIPAL 
COMPONENTS AND THEIR VARIANCES 

A primary problem of statistical inference in principal component analysis is 
to estimate the vectors p ⑴， …， p (p) and the scalars A,, ,.., We apply the 
algebra of the preceding section to an estimate of the covariance matrix. 

Theorem 11.3.1. Let x v ...,x N be N (>p) observations from 
where ^ is a matrix with p different characteristic roots. Then a set of maximum 
likelihood estimators 0 / A ,,.,., \ p and p ⑴ ，…， P (p) defined in Theorem 11.2.1 
consists of the roots k l > > k p of 

( 1 ) \t-kl\ -0 

and a set of corresponding vectors b {{ \ • • • ， b (p) satisfying 


(2) 


(3) 

办 (,) 7 » (, ) = 1， 


A 

where 2 is the maximum likelihood estimate of 2. 
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Proof. When the roots of | S — A/f =0 are different，each vector P ⑴ is 
uniquely defined except that pP) can be replaced by - P (l) . If we require that 
the first nonzero component of P (i) be positive, then p (f) is uniquely defined, 
and jjl, A, p is a single-valued function of jjl, S. By Corollary 3.2.1, the set of 
maximum likelihood estimates of jjl, A, p is the same function of (1,2. This 
function is defined by (1), (2)，and (3) with the corresponding restriction that 
the first nonzero component of 6 ⑴ must be positive. [It can be shown that if 
|X| #0， the probability is 1 that the roots of ⑴ are different, because the 
conditions on 2 for the roots to have multiplicities higher than 1 determine a 
region in the space of 2 of dimensionality less than jp (p + 1); see Okamoto 
(1973).] From (18) of Section 11.2 we see that 

(4) S = pAp 、 E\p (0 P u)/ ， 

and by the same algebra 

(5) 玄=办 0 ' 

Replacing b 0) by 一 clearly does not change Since the 

likelihood function depends only on X (see Section 3.2), the maximum of the 
likelihood function is attained by taking any set of solutions of (2) and (3). 

■ 

It is possible to assume explicitly arbitrary multiplicities of roots of S. If 
these multiplicities are not all unity, the maximum likelihood estimates are 
not defined as in Theorem 11,3,1. [See Anderson (1963a).] As an example 
suppose that we assume that the equation |2 — A/| =0 has one root of 
multiplicity p. Let this root be A r Then by Corollary 11.2.1, 2 - \ X I is of 
rank 0 ； that is, X - or 2 = Aj/. If X is distributed according to 

AT(|x, 2) — Aj/), the components of X are independently distributed 

with variance A t . Thus the maximum likelihood estimator of \ x is 

△ i p N 

( 6 ) = H E (、-元) 2 ， 

and 2 = 入 !/， and p can be any orthogonal matrix. It might be pointed out 
that in Section 10.7 we considered a test of the hypothesis that S = \ X I (with 
Aj unspecified), that is, the hypothesis is that S has one characteristic root of 
multiplicity p. 

In most applications of principal component analysis it can be assumed 
that the roots of S are different. It might also be pointed out that in some 
uses of this method the algebra is applied to the matrix of correlation 



11.4 MAXIMUM LIKELIHOOD ESTIMATES OF THE PRINCIPLE COMPONENTS 469 


coefficients rather than to the covariance matrix. In general this leads to 
different roots and vectors. 


11A COMPUTATION OF THE MAXIMUM LIKELIHOOD 
ESTIMATES OF THE PRINCIPAL COMPONENTS 

There are several ways of computing the characteristic roots and characteris¬ 
tic vectors (principal components) of a matrix i or X, We shall indicate 
some of them. 

One method for small p involves expanding the determinantal equation 

(1) 0=|2 - A/| 

end solving the resulting pth-degree equation in A (e.g,，by Newton’s method 
or the secant method) for the roots A! > A 2 > … > A ， Then X - \ t I is of 
rank p — 1 ， and a solution of (2 — A, /)P ⑴ = 0 can be obtained by taking 舄⑴ 
as the cofactor of the element in the first (or any other fixed) column and jth 
row of S - XJ. 

The second method iterates using the equation for a characteristic root 
and the corresponding characteristic vector 

(2) Xx = \x y 

where we have written the equation for the population. Let x (0) be any vector 
not orthogonal to the first characteristic vector, and define 

(3) x ⑴ = ( 卜 I} ， y u) = , / x (i) , ^0,1.2,.,.. 

V X (0 X (^ 

It can be shown (Problem 11.12) that 

(4) liy ⑴ =± p (1 )， lim x ； 0 x (J) = 

The rate of convergence depends on the ratio A 2 /A“ the closer this ratio is 
to 1， the slower the convergence. 

To find the second root and vector define 

(5) 

Then 

(6) 2 2 p (,) = 5 ； p (,) - A,p (1) p (l), p (0 

= 2f) = A,p(') 
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if i ¥= 1. and 

(7) = 

Thus 入 2 is the largest root of and p (2) is the corresponding vector. The 
iteration process is now applied to X 2 to find 入 2 and P ⑺. Defining 2 3 = S 2 
— A 2 p (2l p (2)f , we can find 入 3 and p (3 \ and so forth. 

There are several ways in which the labor of the iteration procedure may 
be reduced. One is to raise 2 to a power before proceeding with the 
iteration. Thus one can use X 2 , defining 

(8) x u) = 少⑴ r = 0,1,2,. 

\ X U) X 0) 

This procedure will give twice as rapid convergence as the use of (3). Using 
2 4 - X 2 X 2 will lead to convergence four times as rapid, and so on. It should 
be noted that since 2 1 is symmetric, there are only p(p + 1)/2 elements to 
be found. 

Efficient computation, however, uses other methods. One method is the 
QR or QL algorithm. Let X 0 — X. Define recursively the orthogonal Q t and 
lower triangular L, by 2^ = Q i L i and = L { Q { (= Q'^iQiX = 1 ， 2,• • • • 
(The Gram—Schmidt orthogonalization is a way of finding Q t and L,; the QR 
method replaces a lower triangular matrix L by an upper triangular matrix 
R.) If the characteristic roots of X are distinct, + i = A*, where A* 

is the diagonal matrix with the roots usually ordered in ascending order. The 
characteristic vectors are the columns of lim^^ Q\Q\^ X ••• Q\ (which is com¬ 
puted recursively). 

A more efficient algorithm (for the symmetric X) uses a sequence of 
Householder transformations to carry 1 to tridiagOnal form. A Householder 
matrix is H = I — 2aa r where = 1. Such a matrix is orthogonal and 
symmetric. A Householder transformation of the symmetric matrix X is 
H^H. It is symmetric and has the same characteristic roots as 2 ； its 
characteristic vectors are H times those of X. 

A tridiagonal matrix is one with all entries 0 except on the main diagonal, 
the first superdiagonal, and the first subdiagonal. A sequence of p -2 
Householder transformations carries the symmetric 2 to tridiagonal form. 
(The first one inserts 0’s into the last p — 2 entries of the first column and 
row of HXH, etc. See Problem 11.13.) 
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The QL method is applied to the tridiagonal form. At the ith step let the 
tridiagonal matrix be let P^ l) be a block-diagonal matrix (Givens matrix) 

I 0 0 o" 

0 cos dj — sin d } - 0 

0 sin 6 j cos 6- 0 ， 

0 0 0 I 

where cos 8 - is the ;th and j + 1st diagonal element; and let — Pp}jT^\, 
/ = - 1. Here 0 ; is chosen so that the element in position jj + 1 in 

is 0. Then P {1) = -- - P^l j is orthogonal and P (l) T^ l) =R {i) is lower 

triangular. Then 7^ / + i) = 及 (/ ) 尸 ⑴ r (= 尸 0) 打 0 尸 ⑴ ’）is symmetric and tridiago- 
nal. It converges to A* (if the roots are all different). For more details see 
Chapters II/2 and II/3 of Wilkinson and Reinsch (1971)，Chapter 5 of 
Wilkinson (1965)，and Chapters 5, 7, and 8 of Golub and Van Loan (1989). 
A sequence of one-sided Householder transformation (H X) can cany 2 to 
R (upper triangular), thus effecting the QR decomposition. 
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In Table 3.4 we presented three samples of observations on varieties of iris 
[Fisher (1936 )]； as an example of principal component analysis we use one of 
those samples, namely Iris versicolor. There are 50 observations (iV = 50, 
n = N — 1 = 49). Each observation consists of four measurements on a plant: 
x y is sepal length, x 2 is sepal width, x 3 is petal length, and x 4 is petal width. 
The observed sums of squares and cross products of deviations from means 
are 


50 

(1) A= E (x a -x)(x n -xy 

01 st 1 


and an estimate of X is 


13.0552 

4.1740 

8.9620 

2.7332 

4.1740 

4.8250 

4.0500 

2.0190 

8.9620 

4.0500 

10.8200 

3.5820 

2.7332 

2.0190 

3.5820 

1.9162 


(2) 


0.266433 
= 丄 3 = 0.085 184 
_西 0.182899 
^ 0.055 780 


0.085 184 
0.098 469 
0.082653 
0.041204 


0.182899 
0.082653 
0.220816 
0.073102 


0.055 780’ 
0.041204 
0.073102 
0.039106 ； 
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We use the iterative procedure to find the first principal component, by 
computing in turn z (;) = As an initial approximation, we use z (0), ^ 

(1,0,1,0). It is not necessary to normalize the vector at each iteration; but to 
compare successive vectors, we compute 广 1 ) = each of which is an 

approximation to / l5 the largest root of S. After seven iterations, r, (7> agree to 
within two units in the fifth decimal place (fifth significant figure). This 
vector is normalized, and S is applied to the normalized vector. The ratios, 
r/ 8) , agree to within two units in the sixth place; the value of is (nearly 
accurate to the sixth place) /, = 0.487 875. The normalized eighth iterated 
vector is our estimate of p (1 ), namely, 


(3) 


6 ⑴ 


f 0.686 724 4 \ 
0.305 3463 
0.6236628 
i 0.214 983 7 ) 


This vector agrees with the normalized seventh iterate to about one unit in 
the sixth place. It should be pointed out that 々 and 6 (1 ) have to be calculated 
more accurately than l 2 and b a \ and so forth. The trace of S is 0.624 824, 
which is the sum of the roots. Thus is more than three times the sum of 
the other roots. 

We next compute 


(4) S 2 ^S-l x b (l) b w 


0.0363559 
-0.0171179 
—0.0260502 
t —0.016 2472 


-0.0171179 

0.0529813 

-0.0102546 

0.0091777 


-0.026050 2 
-0.0102546 
0.0310544 
0.0076890 


-0.016 247 2' 
0.0091777 
0.007 6890 
0.0165574 』 


and iterate z (j) = S 2 z (;_t) , using z l0), = (0,1,0,0). (In the actual computation 
S 2 was multiplied by 10 and the first row and column were multiplied by -1.) 
In this case the iteration does not proceed as rapidly ； as will be seen, the 
ratio of / 2 to / 3 is approximately 1.32. On the last iteration, the ratios agree 
to within four units in the fifth significant figure. Wu obtain l 2 = 0.0723828 
and 


(5) 


-0.669033 
0.567 484 
0.343 309 
i 0.335 307 j 


The third principal component Is found from S 3 = S 2 — l 2 b <2i b( T>, , and the 
fourth from S 4 = S 3 — l 3 b( 3 、 b 13 、'. 
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The results may be summarized as follows: 


(6) 


=(0.4879,0.0724,0.0548,0.0098), 





'0.6867 

-0.6690 

-0.2651 

0.10231 

(7) 

B = 


0.3053 

0.5675 

-0.7296 

-0.2289 


0.6237 

0.3433 

0.6272 

-0.3160 




s 0.2150 

0.3353 

0.0637 

0.9150 , 


The sum of the four roots is = 0.6249, compared with the trace of the 
sample covariance matrix, trS — 0.624 824. The first accounts for 78% of the 
total variance in the four measurements ； the last accounts for a little more 
than 1 %. In fact, the variance of 0_7ji 、+ 0.3x 2 + 0.6 + 0.2^ 4 (an approxi¬ 
mation to the first principal component) is 0.478, which is almost 77% of the 
total variance. If one is interested in studying the variations in conditions that 
lead to variations of (x,, x 2 , x 3i x 4 ), one can look for variations in conditions 
that lead to variations of 0.7x } + 0.3 + 0*6a*^ + 0,2x 4 . It is not very impor¬ 
tant if the other variations in (x l7 jc 2 , a * 3 , x a ) are neglected in exploratory 
investigations. 

1L6. STATISTICAL INFERENCE 
11*6 丄 Asymptotic Distributions 

In Section 13.3 we shall derive the exact distribution of the sample character¬ 
istic roots and vectors when the population covariance matrix is I or 
proportional to that is, In the case of all population roots equal. The exact 
distribution of roots and vectors when the population roots are not ull equal 
involves a multiply infinite scries of zonal polynomials ； thaL development is 
beyond the .Scope of this hook. [Sec Muirhracl (1982).] We derive the 
asymptotic distribution of the roots and vectors when the population roots 
are all different (Theorem 13,5.1) and also when one root is multiple 
(Theorem 13.5.2). Since it can usually be assumed that the population roots 
are different unless there is information to ihc contrail, wc ftummarine here 
Theorem 13^.1. 

As earlier, let the characteristic roots of S be A 1 > ••• > and the 
corresponding characteristic vectors be normalized so 

=■ ] and satisfying > 0, i - 1，. ••，/?. Let the roots and vectors of S be 
~ > … >/ p and 6 ⑴， … ， 6 (p) normalized so b U), b U) = 1 and satisfying b u > 0, 
i = 1， Let 名 =^{l t - A ; ) and (6 ⑴一 P (0 ), i — 1，."_ p* Then 

in the limiting normal distribution the sets and g (n ， __• ， g l ’ ;1 are 

Independent and d v •… d p are mutually Independent. The element has 
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the limiting distribution MO,2A^). The covariances of g ip) in the 

limiting distribution are 

(1) WU (0 )= f ： A ， A : P ， ㈨ 、 

k - 1 ( 入 | 一 A/lc) 

k 丰 i 

(2) ^( g (0 , g 0) )= ——⑴ , ， I 丰 j . 

(入，-入 ,) 


See Theorem 13.5.1. 

In making inferences about a single ordered root, one treats / r as approxi¬ 
mately normal with mean 入 , and variance 2 入 f/n. Since / i is a consistent 
estimate of A p the limiting distribution of 


(3) 




K 


is /V(G, 1). A two-tailed test of the hypothesis A, = A? has the (asymptotic) 
acceptance region 


(4) 


- z i £ ) ^ ^ 


where the value of the N(0,1) distribution beyond z{e) is The interval 
⑷ can be inverted to give a confidence interval for A f with confidence 1 — e: 


(5) 


1 + yjl/n z(s) 


<\ t < 


1 一 ^2/nz( s) 


Note that the confidence coefficient should be taken large enough so 
^T/nzis) < 1. Alternatively, one can use the fact that the limiting distribu¬ 
tion of \/n (log l x — log A,) is N(0,2) by Theorem 4.2.3. 

Inference about components of a vector p (l ) can be based on treating b") 
as being approximately normal with mean p ⑴ and (singular) covariance 
matrix l/n times (IX 


11.6.2. Confidence Region for a Characteristic Vector 

We use the asymptotic distribution of the sample characteristic vectors to 
obtain a large-sample confidence region for the ith characteristic vector of 2 
[Anderson (1963a)]. The covariance matrix (1) can be written 


(6) 
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where A,, is the pXp diagonal matrix with 0 as the ith diagonal element and 
^/A r Aj / - \j) as the ;th diagonal element, j # i\ A* is the (p - l)X (p - l) 
diagonal matrix obtained from A, by deleting the ith row and column; and 
Pf is the pX(p — t) matrix formed by deleting the /th column from p. Then 
h (I) = A ， 一 1 昧 — p (,) ) has a limiting normal distribution with mean 0 
and covariance matrix 

(7) tr(^) 

end 

(8) A (/)/ A (f) = p ⑴） 

has a limiting ^^distribution with p — 1 degrees of freedom. The matrix of 
the quadratic form in ]/n(b (i) — p (r) ) is 

(9) ^Ay 2 ^'= l P 0) (x- 2 + x) p(i),_ p ⑴ ( 夺一 2 + r)P (i), 

^A^- 1 -2/ + {l/A t )2 

because PA~ 1 P ， = 2^, PP' =/, and PAP' -2. Then ⑻ is 

(10) „( 6 (0_p(oy[ A ^-i -2/+(1/^)2](6 (,) - p (0 ) 

= nb^'[\il- 1 -21+ (1/A f )2] 6 ⑴ 

= n[A,.ft (,), 2 ； ~ l ft (,) + (l/A f )* (, ， ), 2* (0 -2], 

because p (f), is a characteristic vector of 2 with root A,, and of 2 _l with 
root 1/A r On the left-hand side of (10) we can replace 2 and A, by the 
consistent estimators S and l T to obtain 

(11) p(oy[/. 5 -i _2/+ p (, )) 

= n[/,-p (,), 5- l p^ + (1 八 ) p( ( ) ，邓 ⑴ _ 2], 

which has a limiting ^^distribution with p — \ degrees of freedom, 

A confidence region for the /th characteristic vector of X with confidence 
1 一农 consists of the intersection of = 1 and the set of p ⑴ such that 

the right-hand side of (11) is less than where Pr{^ 2 _! > = e. 

Note that the matrix of the quadratic form (9) is positive semidefinite. 
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This approach also provides a test of the null hypothesis that the ith 
characteristic vector is a specified pf (Po^Po 0 ^ D* The hypothesis is 
rejected if the right-hand side of (11) with p (i) replaced by exceeds 

Mallows (1961) suggested a test of whether some characteristic vector of 
S is p 0 . Let p 0 be pX(p - l) matrix such that 卩沿 0 = 0， If the null 
hypothesis is true, ^ Q X and are independent (because p o is a nonsingu¬ 
lar transform of the set of other characteristic vectors). The test is based on 
the multiple correlation between and p ( ',X. 1 n principle, the test 

procedure can be inverted to obtain a confidence region. The usefulness of 
these procedures is limited by the fact that the hypothesized vector is not 
attached to a characteristic root; the interpretation depends on the root (e.g., 
largest versus smallest), 

Tyler (1981), (1983b) has generalized the confidence region (11) to include 
the vectors in a linear subspacc. He has also studied casing the restrictions of 
a normally distributed parent population. 


11.6.3. Exact Confidence Limits on the Characteristic Roots 

We now consider a confidence interval for the entire set of characteristic 
roots of X, namely, Ai k k A,, [Anderson (1965a)]. We use the facts that 
pO>2p(0 = At p ⑴ ，p( ) 二 1， i=lp 9 and p ⑴⑻ = 0 卩⑴ , p(< Then 
and are uncorrelated and have variances and A p ， respec¬ 
tively. Hence /ip ⑴ ’Sp ⑴ /' and are independently dis¬ 

tributed as x 2 with n degrees of freedom. Let / and u be two numbers such 
that 

( 12 ) 

Then 


1 — Pr{«/ < ^ 2 }Pr{^ 2 <««}. 


(13) 




A, 


A. 


n f • b，Sb 、 v b’Sb 

< Pr{ min - < A., A. < max ^ 厂 

U^«=i u y yb^\ 1 


Pr f . 
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Theorem 11.6.L A confidence interval for the characteristic roots of X 
with confidence at least 1 - e is 

(14) 

where l and u satisfy (12)* 

A tighter inequality can lead to a better lower bound. The matrix H = 
np，sp has characteristic roots because p is orthogonal. We use 

the following lemma. 

Lemma 11.6.U For any positive definite matrix H 

(15) ch p (H)^j ri <ch l (H), / = 1 ，.… p' 

where 1 = and ch^iH) and ch^H) are the minimum and maxunum 
characteristic roots of H 、 respectively. 

Proof. From Theorem A.2.4 in the Appendix we have ch p (H) < h u < 
ch x (H) and 

(16) ch p (iT l ) <h 11 /-I .… ， p. 

Since ch p (H) - l/chj(// _l ) and eh///) = l/ch/W— 1 )，the lemma follows. 

■ 

The argument for Theorem 5.2.2 shows that l/(\ p h pp ) is distributed 
X 2 with n -p + 1 degrees of freedom, and Theorem 43,3 shows that h pp is 
independent of h xv Let 广 and u r be two numbers such that 

(17) 1 - e=Pr{nr <x^}Pr[xn-„ + i 

Then 

(18) 1 - e= Pr|n/ r < < nu'j 

<Pr|^-< A p) A, <^J 


since ch p (H) = nl p and ch x (H) = nl v 
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Theorem 1L6.2. A confidence interval for the characteristic roots of 2 
with confidence at least l — £ is 


(19) p- S < A t < , 

w^here V and u’ satisfy (17 乂 

Anderson (1965a, 1965b) showed that the above confidence bounds are 
optimal within the class of bounds 

(20) <A, 


where / and g are homogeneous of degree 1 and are monotonically nonde¬ 
creasing in each argument for fixed values of the others* If (20) holds with 
probability at least 1 — then a pair of numbers u r and V can be found to 
satisfy (17) and 

(-1) S f ， jr Sg (今 ，…， 八)， 

The homogeneity condition means that the confidence bounds are multiplied 
by c 2 if the observed vectors are multiplied by c (which is a kind of scale 
invariance). The monotonicity conditions imply that an increase in the size of 
S results in an increase in the limits for 2 (which is a kind of consistency). 

The confidence bounds given in (31) of Section 10.8 for the roots of X 
based on the distribution of the roots of S when 2 = / are greater. 


11,7. TESTING HYPOTHESES ABOUT THE CHARACTERISTIC 
ROOTS OF A COVARIANCE MATRIX 

11.7.1. Testing a Hypothesis about the Sum of the Smallest 
Characteristic Roots 

An investigator may raise the question whether the last p — m principal 
components may be ignored, that is, whether the first m principal compo 
nents furnish a good approximation to X. He may want to do this if the sum 
of the variances of the last principal components is less than some specified 
amount，say y. Consider the null hypothesis 


⑴ 


A m + i + ... + > y, 
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where y is specified，against the alternative that the sum is less than y. If the 
characteristic roots of 2 are different, it follows from Theorem 13.5.1 that 


(2) 4n[ E l ,- E \ 

\ i« m +1 / —m + 1 

has a limiting normal distribution with mean 0 and variance 2Ef«= m + 1 入 》. The 
variance can be consistently estimated by + i Then a rejection region 
with (large-sample) significance level e is 


(3) 


E l i<7 — 

f-1 




2T(2 及）， 


where z(2s) is the upper significance point of the standard normal distribu¬ 
tion for significance level e. The (large-sample) probability of rejection is e if 
equality holds in (1) and is less than e if inequality holds. 

The investigator may alternatively want an upper confidence interval for 
+ i 入 ， with at least approximate confidence level 1 — 公 . It is 


⑷ 


p 

E 

+ I 


P 

E i, 

1 




i} 




■z(2e), 


If the right-hand side is sufficiently small (in particular less than y\ the 
investigator has confidence that the sum of the variances of the smallest 
p _ m principal components is so small they can be neglected. Anderson 
(1963a) gave this analysis also in the case that + 1 … = 入 


1L7.2. Testing a Hypothesis about the Sum of the Smallest 
Characteristic Roots Relative to the Sum of All the Roots 


The investigator may want to ignore the last p — m principal components if 
their sum is small relative to the sum of all the roots (which is the trace of the 
covariance matrix)* Consider the null hypothesis 


(5) 




入 〃， +1 + • + 入 " 

入 l + …\ 


>5, 


where 5 is specified, against the alternative that /(X) < 5. We use the fact 


that 




(6) 

聲 ) — 

A m + i + … +A P 

( 入 1 + … +A P ) 2 


則 . 

\ + … +、 

/ = m + l ， ... ， p. 


dK x 

( 入 1 + … +A P ) 2 
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Then the asymptotic variance of f(J) is 

⑺ 2 (Ti%)、 + ... O + 2 間 V?- + … K) 

when equality holds in (5)，by Theorem 4.2.3. The null hypothesis H is 
rejected if v^T[/(/) 一 5] is less than the appropriate significance point of the 
standard normal distribution times the square root of (7) with 入 ’s replaced by 
/’s and tr 2 by trS. Alternatively one can construct a large 琴 sample confidence 
region for /(\). A confidence region of approximate confidence 1 一 e is 
[z^=z(2s)] 

( 8 ) 

Ef„„ +1 A, E “ +1 /, 

If the right-hand side is sufficiently small, the investigator may be willing to 
let the first principal components represent the entire vector of measure¬ 
ments. 


11.7 上 Testing Equality of the Smallest Roots 

Suppose the observed X is given by K+ 1/ + jx，where V and V are unobserv¬ 
able random vectors with means 0 and jjl is an unobservable vector of 
constants. If SVV f = o- 2 /, then V can be interpreted as composed of errors 
of measurement: uncorrelated components with equal variances. (It is 
assumed that all components of X are in the same units.) Then V can be 
interpreted as made up of the systematic parts and is supposed to lie in an 
m - dimensional space. Then SW , = ^ is positive semidefmite of rank m. 
The observable covariance matrix 2 = 中 + <r 2 I has a characteristic root of 
a 2 with multiplicity p - m (Problem 11.4 )、 

In this subsection we consider testing the null hypothesis that A m + 1 = ••• 
=A p . That is equivalent to the null hypothesis that 2 = 中 + (j 2 /，where 少 is 
positive semidefinite of rank m. In Section 10.7, we saw that when m = 0, the 
likelihood ratio criterion was the \pNth power of the ratio of the geometric 
mean to the arithmetic mean of the sample roots. The analogous criterion 
here is the ^Nth power of 


(U/ 


(p - m ) p 


(9) 
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It is also the likelihood ratio criterion, but we shall not derive it, [See 
Anderson (1963a).] Let 广入 m + 1 ) = i = m + The logarithm 

of (9) multiplied by —n is asymptotically equivalent under the null hypothesis 
to 


(10) -n log Y\ m) log 




I + i ■ 


p — m 


n E log(A m + 1 + ^<) +a( P -m)l 0 g Et；U + 1(A 〃， + 1 +’ 广 0 


4- E ‘ 


d. 




+ ( p - m) log 


p — m 






M- E 

f-m +1 


d, 


d] 


+ 


+ (p-m) 


+ 2A m + 1 W 


¥ 


A± 2, m + in 


(P_ m ) Kn^\ n2 2( p — m) 2 \ 2 nl + l n 

( 二 f=m + l O 


+ 


2\l 


+ i 


p 

1 

p \ 2 

E d}- 

——-— 

v — m 

E 叫 

i=m + 1 

r 

\ / = + 1 / 


2{p-m)\l iyl n + j 

+ 叫 1). 


It is shown in Section 13.5.2 that the limiting distribution of d m + l ,..^d p is 
the same as the distribution of the roots of a symmetric matrix U 22 = 
f，/ = m + 1， .. • ， p，whose functionally independent elements are independent 
and normal with mean 0; an off-diagonal element u ip i < /, has variance 
4+1， and a diagonal element u it has variance 2A^ +1 . See Theorem 13,5.2 、 
Then (10) has the limiting distribution of 


( 11 ) 


1 


2Al 


tr U 22 


+ 1 


2 入 i 


p — m 
tr t/ 22^22 




尸一 m 


(trf/ 22 )^ 


2A^i 


2E«r. + E u l- 


t<) 


p — m 


p 

E «, 

= m-\ 1 
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Thus 叫 is asymptotically x 2 with {(p - m\p - m - l) degrees 

of freedom ； ![£ 厂 ^ - + 1 w") 2 /(P 一 W )]/AU S a 办 mptotically ? 
with p - m - l degrees of freedom. Then (10) has a limiting ^^distribution 
with 3( p - ； 7i + 2Xp - m - l) degrees of freedom. The hypothesis is rejected 
if the left-hand side of (10) is greater than the upper-tailed significance point 
of the : -distribution. If the hypothesis is not rejected, the investigator may 
consider the last p - m principal components to be composed entirely of 
error. 

When the units of measurement are not all the same，the three hypotheses 
considered in Section 11.7 have questionable meaning. Corresponding 
hypotheses for the correlation matrix also have doubtful interpretation. 
Moreover, the last criterion does not have (usually) a / Y 2 -distribution* More 
discussion is given by Anderson (1963a )， 

The criterion (9) corresponds to the sphericity criterion of Section 10j, 
and the number of degrees of freedom of the corresponding ^^distribution 
is i(p - mXp — m + 1)-1. 

11.8. ELL1PTICALLY CONTOURED DISTRIBUTIONS 


IK8.1. Observations Elliptically Contoured 

Let x,__ x A be N observations on a random vector X with density 

(1) 1 屮 1 [(: c - vmx-v )]， 

where ^ is a positive definite matrix, = (x - v)’ 少 〜 】（x - v )， and 
//? 4 < oc. Define K-p SR A /[{ <FR 2 ) 2 {p+ 2)]- L Then SX = v = yi and 
HX- vXX-vy ={^'R 2 /p)^ ^1. 

The maximum likelihood estimators of the principal components of 2 are 
the characteiistic roots and vectors of 2 = {o>R 2 /p)\ given by (20) of 
Section 3,6. Alternative estimators are the characteristic roots and vectors of 
S, the unbiased estimator of 2. The asymptotic normal distributions of these 
estimators are derived in Section 13*7 (Theorem 13.7.1). Let ^ == pAp’ and 
S = BLB\ where A and L are diagonal and p and B are orthogonal* Let 
D = \/N(L - A) and G = \/N(B - PX Then the limiting distribution of D 
and G is normal with G and D independent. 

The variance of is (2 + 3 k)A^ and the covariance of and d } (i 垆 /) is 
K\ ( \ r The covariance of g, is 

(2) =(1+K) E — 

a (A ， -A;) 
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The covariance of and g. is 


( 3 ) 




A, A, 

-㈣㈡ P ，. 


For inference about a single ordered root A, the limiting standard normal 
distribution of 況 (// - A f )/()/2(2 + 3k) l t ) can be used. 

For inference about a single vector the right-hand side of (11) in Section 
11.6.2 can be used with S replaced by (1 + k)S and S^ ] by + k)- 

It is shown in Section 13,7.1 that the limiting distribution of the logarithm 
of the likelihood ratio criterion for testing the equality of the q-p — m 
smallest roots is the distribution of (1 + 


11.8.2. EUiptically Contoured Matrix Distributions 
Suppose the density of X = x^) is 

-1^1 _W/2 g[tr4 屯 _I +^(1-v)], 

where A = (X - E f N x){X - = nS and n = N - 1, Thus x and A are a 

sufficient set of statistics. 

Now consider A — YY 1 having the density g(tr A). Let A = BLB\ where L 
is diagonal with diagonal elements A > … > 夂 and is orthogonal with 
p a > 0. Then L and B are independent; the roots have the density 

(18) of Section 13.7, and the matrix B has the conditional Haar invariant 
distribution. 


PROBLEMS 


IU. (Sec. 11 2) 


Prove that the characteristic vectors of 




are 


- (-VJ) 


corresponding to roots 1 + /， and 1 — p. 

11 丄 (Sec, 11.2) Verify that the proof of Theorem 11.2,1 yields a proof of Theorem 
A,2,1 of the Appendix for any real symmetric matrix. 
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1L3. (Sec. 11,2) Let z —y 4 -jc, where Sy — cf jc = 0, Syy 1 = 屯， Sxx 1 = o -2 /, Syx' 

— 0. The p components of y can be called systematic parts, and the compo¬ 
nents of x errors. 

(a) Find the linear combination 7’2 of unit variance that has minimum error 
variance (i ， e M y'x has minimum variance). 

(b) Suppose <p lf ^ cr 2= = 1, Find the linear function y l z of unit 

variance that maximizes the sum of squares of the correlations between z x 
and 7 ' 2 , i = 

(c) Relate these results to principal components. 

11.4. (Sec* 11.2) Let 2 = 4) + cr 2 /, where is positive semideflnite of rank m. 

Prove that each characteristic vector of is a vector of 2 and each root of 2 

is a root of plus a 2 , 

11-5. (Sec, 11,2) Let the characteristic roots of 1 lx; A ■之 A 2 之 ••• 之 艺 0. 

(a) What is the form of £ if 入届 =A 2 = … ^ \ p > 01 What is the shape of an 
ellipsoid of constant density? 

(b) What is the form of 2 if A| > A 2 = … => 0? What is the shape of an 
ellipsoid of constant density? 

(c) What is the form of X if A!= … > 0? What is the shape of 
the ellipsoid of constant density? 

1L6. (Sec* 11,2) Intraclass correlation. Let 


2==<J 2 [(1— p)/ + pEE ； ], 

where e = (1,Show that for p > 0 , the largest characteristic root is 
o* 2 [l +(P — l)p] and the corresponding characteristic vector is e. Show that if 
e’jc = 0 , then jc is a characteristic vector corresponding to the root <r 2 (l - p\ 
Show that the root <r 2 (l — p) has multiplicity p — 1. 

11 . 7 . (Sec ， 11.3) In the example of Section 9.6, consider the three pressing opera¬ 
tions (jc 2 , x 4 , x 5 \ Find the first principal component of this estimated covari¬ 
ance matrix. [Hint ： Start with the vector ( 1 ,1,1) and iterate.) 

11 . 8 . (Sec. 11*3) Prove directly the sample analog of Theorem 11.2.1, where Lx a — 
❶， Lx a x r a ^A. 

11 - 9 . (Sec. 11.3) Let and l p te the largest and smallest characteristic roots of S, 
respectively. Prove Sl y > A, and Sl p ^ \ p . 

11.10, (Sec. 11,3) Let % be the first population principal component with 

variance T*(f/|) = A,, and let —b il), X be the first sample principal compo¬ 
nent with (sample) variance /, (based on S\ Let S* be the covariance matrix 
of a second (independent) sample. Show ^ A,. 
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11.11. (Sea 1L3) Suppose that <r i} > 0 for every Uj [X — tcr^)]. Show that (a) the 
5 coefficients of the first principal component are all of the same sign, and 

(b) the coefficients of each other principal component cannot be all of the 
same sign. 

11.12. (Sec. 11,4) Prove ⑷ when A, > A 2 . 

(a) Show 2' = PA^P’* 

(b) Show 

。 A 1 , p (士 八) p 、、• 

where = and s 广 \/^/x[ i] x U] . 

(c) Show 



where E u has 1 in the upper left-hand position and 0's elsewhere. 

(d) Show—= 1/( 卩山 » 2 , 

(e) Conclude the proof* 


11.13. (Sec J 1,4) Let 

1 0 
0 H 


X = 


^ii 

叮山 


<i 、 


/C = 


where H — — 2aa f and a has p—l components. Show that a can be 

chosen so that in 


KlK^ 






has all 0 components except the first* 

11.14, (Sec, 11.6) Show that 

log l, - 2 {e) < log A,< log/, + 2 (f) 


is a confidence interval for log A,- with approximate confidence 1 - e, 

11.15. (Sec. 11,6) Prove that u f <" if " = / and p> 2. 

11.16. (Sec, 1]*6) Prove that u <u* if 1 = 1* and p>2, where r and w* are the / 
and u of Section 10,8.4, 



486 


PRINCIPAL COMPONENTS 


1L17, The lengths* widths, and heights (in millimeters) of 24 male painted turtles 
[Jolicoeur and Mosimann (I960)] are given below. Find the (sample) principci 
components and their variances. 

Case Case 

No, Length Width Height No. Length Width Height 


311104255667 

44444^-444444 


0 0 13 93536556 
999989999990 


67790 0157815 
111 . — -22222233 


3 45678901234 
111 1 11122222 


7559879980^0 

333333333444 


4 8 0 4 5 13 3 2 9 8 6 
7 78888888888 


3 46 12346723 
99900000011 


4 


456789012 

111 





CHAPTER 12 


Canonical Correlations 
and Canonical Variables 


12.1. INTRODUCTION 


In this section we consider two sets of variates with a joint distribution, and 
we analyze the correlations between the variables of one set and those of the 
other set We find a new coordinate system in the space of each set of 
variates in such a way that the new coordinates display unambiguously the 
system of correlation. More precisely, we find linear combinations of vari¬ 
ables in the sets that have maximum correlation; these linear combinations 
are the first coordinates in the new systems. Then a second linear combina¬ 
tion in each set is sought such that the correlation between these is the 
maximum of correlations between such linear combinations as are uncorre¬ 
lated with the first linear combinations. The procedure is continued until the 
two new coordinate systems are completely specified. 

The statistical method outlined is of particular usefulness in exploratory 
studies. The investigator may have two large sets of variates and may want 
to study the interrelations. If the two sets are very large，he may want 
to consider only a few linear combinations of each set. Then he will want to 
study those Lnear combinations most highly correlated. For example, one set 
of variables may be measurements of physical characteristics, such as various 
lengths and breadths of skulls; the other variables may be measurements of 
mental characteristics, such as scores on intelligence tests. If the investigator 
is interested in relating these，he may find that the interrelation is almost 
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completely described by the correlation between the first few canonical 
variates. 

The basic theory was developed by Hotelling (1935), (1936), 

In Section 12.2 the canonical correlations and variates in the population 
are defined; they imply a linear transformation to canonical form. Maximum 
likelihood estimators are sample analogs. Tests of independence and of the 
rank of a correlation matrix are developed on the basis of asymptotic theory 
in Section 12.4. 

Another formulation of canonical correlations and variates is made in the 
case of one set being random and the other set consisting of nonstochastic 
variables; the expected values of the random variables are linear combina¬ 
tions of the nonstochastic variables (Section 12,6). This is the model of 
Section 8.2, One set of canonical variables consists of linear combinations of 
the random variables and the other set consists of the nonstochastic vari¬ 
ables; the effect of the regression of a member of the first set on a niember of 
the second is maximized. Linear functional relationships are studied in this 
framework. 

Simultaneous equations models are studied in Section 12,7. Estimation of 
a tingle equation in this model is formally identical to estimation of a single 
linear functional relationship. The limited-information maximum likelihood 
estimator and the two-stage least squares estimator are developed. 


12.2. CANONICAL CORRELATIONS AND VARIATES 
IN THE POPULATION 


Suppose the random vector X of p components has the covariance matrix 2 
(which is assumed to be positive definite). Since we are only interested in 
variances and covariances in this chapter, we shall assume SX— 0 when 
treating the population. In developing the concepts and algebra we do not 
need to assume that X is normally distributed, though this latter assumption 
will be made to develop sampling theory. 

We partition X into two subvectors of p y and p 2 components, respec¬ 
tively, 




For convenience we shall assume p } <f 2 - The covariance matrix is parti¬ 
tioned similarly into p! and p 2 rows and columns, 


( 2 ) 


2 : 


艺 II 
^21 


s I2 \ 

^22 j 
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In the previous chapter we developed a rotation of coordinate axes to a new 
system in which the variance properties were clearly exhibited. Here we shall 
develop a transformation of the first p } coordinate axes and a transformation 
of the last p 2 coordinate axes to a new (/?, + p 2 )-system that will exhibit 
clearly the intercorrelations between X(】）and X (2 \ 

Consider an arbitrary linear combination, <7 = a'X (I 〉，of the components 
of X (l \ and an arbitrary linear function, y'X (2 \ of the components of 
We first ask for the linear functions that have maximum correlation ， 
Since the correlation of a multiple of U and a multiple of V is the same as 
the correlation of U and K，we can make an arbitrary normalization of a and 
7 , We therefore require a and 7 to be such that U and V have unit 
variance, that is, 

(j) 1 = Zf/ 2 = 

(4) 1 ^ ^y , X [2) X {2), y=-y , l n y. 

We note that SU = = a. 9 SX U) = 0 and similarly SV-0, Then the 

correlation between U and V is 

(5) SUV^ WX ⑴ X( 2) 、= a'S I2 r 

Thus the algebraic problem is to find a and 7 to maximize (5) subject to (3) 
and (4). 

Let 

( 6 ) ijj = a'X ]2 y~{x{a'X u a~ 1 ) -^( 7^ 2 ；7 - 1 ). 

where A and /x are Lagrange multipliers. We differentiate ip with respect to 
the elements of a and 7 , The vectors of derivatives set equal to zero are 

⑺ ： = = 

(8) ^ = ^ !2 a- M 2： 22 7 = 0. 

Multiplication of (7) on the left by a' and ( 8 ) on the left by 7 ' gives 

(9) - Aa’2 n a = 0 ， 

( 10 ) - ^y , X 22 y = 0. 

Since — 1 and 7'2 22 7 = 1， this shows that A = p = aThus (7) 
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and (8) can be written as 


(11) -AX,|Ot + 2 l2 7 = 0 I 

(12) 2 2I a- A 2 22 7 =0, 

since ^' ]2 = 2 2l . In one matrix equation this is 


(13) 


' -入艺11 


' Ot 、 


-入艺 22 』 



In order that there be a nontrivial solution [which is necessary for a solution 
satisfying (3) and (4)], the matrix on the left must be singular; that is, 


(14) 


A2 


ii 

S 2I 


2 

一入 2 


12 


22 


The determinant on the left is a polynomial of degree p. To demonstrate 
this, consider a Laplace expansion by minors of the first /?, columns. One 
term is | ，| - A2 22 | = (-A) Pl+P2 |2 n | -|2 22 |. The other terms in the 

expansion are of lower degree in A because one or more rows of each minor 
in the first /?, columns does not contain A. Since 2 is positive definite, 
!X,J • IX^I ^ 0 (Corollary A.13 of the Appendix). This shows that (14) is a 
polynomial equation of degree p and has p roots, say A, > A 2 > > \ p . [a # 

and y f complex conjugate in (9) and (10) prove A real.] 

From (9) we see that A = ot’ 2 l2 7 is the correlation between U = 
and V^= y f X l2) when a and y satisfy (13) for some value of A. Since we 
want the maximum correlation, we take A = A,. Let a solution to (13) for 
A = A ( be 7 (!) , and let U x - a a); X (l) and V x = Then U i and V x 

are normalized linear combinations of X^ l) and X^ 2 \ respectively, with maxi¬ 
mum correlation. 

We now consider finding a second linear combination of A ■⑴， say U = 
a f X (l \ and a second linear combination of X (2 \ say V = y r X^ 2 \ such that of 
all linear combinations uncorrelated with and V { these have maximum 
correlation. This procedure is continued. At the rth step we have obtained 
linear combinations U x = V { = y (l), X (2 \ = V r = 

with corresponding correlations [roots of (14)] = A I? A ⑺，…， 

We ask for a linear combination of ^ (1) , U — a f X (l \ and a linear combina¬ 
tion of X (2 \ y f X (2 \ that among all linear combinations uncorrelated with 
f/ h V \,. • ，， U”V r 、have maximum correlation. The condition that U be uncor¬ 
related with U t is 

( 15 ) 0 - iUU^ ^a f X (]) X 0), a (l) = 
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Then 

(16) 

The condition that V be uncorrelated with V, is 

(17) 0 = iV y i = y ^ 22 yO\ 

By the same argument we have 

(18) ^Ff^ = 7 , 2 2 ia (l > = 0, 

We now maximize ^f/ r+ 1 I^ +1 , choosing a and 7 to satisfy ⑶， (4) ，（ 15 )， 
and (17) for f = 1 ， 2,， • • ， r，Consider 

(!9) <A,+ i =a ， 2| 2 7 - 士入 ( a %i a _ !) ~ 2^(7^227 - !) 

+ E v’ ； s II ot ( ’.)+ f w’W' 

/*I i-l 

where A ， /it, v n 6 】， ，• •，包 are Lagrange multipliers. The vectors of 

partial derivatives of i/r r+I with respect to the elements of a and 7 are set 
equal to zero, giving 

( 20 ) 

( 21 ) 

Multiplication of (20) on the left by a。)’ and (21) on the left by 7。)’ gives 


(22) 

0 = = Vj, 

(23) 

0 = 0 / 7 (;), S 22 7 (;) = 6j, 


Thus (20) and (21) are simply (11) and (12) or alternatively (13). We therefore 
take the largest A,., say ， A( r+I )，such that there is a solution to (13) satisfying 
OX (4), (15)，and (17) for f = 1， •，，，;*，Let this solution be ot (r+l \ y (r+l \ and 
let U r+{ =a (r+l ^X (l) and V r+l - 7 (r+I), X (2) . 

This procedure is continued step by step as long as successive solutions 
can be found which satisfy the conditions, namely, (13) for some A, ， （ 3) ，（ 4 )， 
(15)，and (17). Let m be the number of steps for which this can be done. Now 


= ^ny- ^u a + E 〜 2uOt ( ’.) = 0, 

i = 1 

= 2 ： 2I a - ^2：227 + i 0 ,^227 (O = O- 
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we shall show that m - p ]y (<p 2 ). Let A = (a ⑴ … r】=(?(】）••• 
7 (m) ), and 

’入⑴ 0 … 0 ’ 

，…、 ▲ 0 入 (2) … 0 

(24) A = • • • 

(0 0 … 入 ⑽ j 

The conditions (3) and (15) can be summarized as 

(25) A^ n A = /, 

Since 2 n is of rank p { and / is of rank m, we have m 幺 /?,. Now let us show 
that m <p x leads to a contradiction by showing that in this case there is 
another vector satisfying the conditions. Since A’2 n is m X p u there exists a 
p y X (/?j 一 m) matrix E (of rank p { - m) such that = 0. Similarly 

there is a /? 2 X (p 2 — m) matrix F (of rank p 2 — m) such that rj2 22 F — 0, 
We also have )TI2 2I £ = AA'2 H £ = 0 and A f S 12 F - Arj2 22 F*= 0. Since E 
is of rank p x -m, E f ^ n E is nonsingular (if m <p } \ and similarly F f ^ 2 2 ^ 
is nonsingular. Thus there is at least one root of 


(26) 


-vE l ^ n E E f l n F 
F f l 2l E 一 vF ， X n F 


because .1 广 2 22 f1 _0. From the preceding algebra we see that 

there exist vectors a and b such that 


(27) E3 n Fb = vE1 u Ea ， 

(28) F，！ 2 'Ea = vF f l n Fb, 

Let Ea=g and Fb=h, We now want to show that v ， g，and h form a 
new solution A (m + I) ， a (m + l) ， *y (w + 1) . Let Since \ f ^ n k = 

A r ^ l2 Fb ==0, ^ is orthogonal to the rows of A'2 n and therefore is a linear 
combination of the columns of £, say £c. Thus the equation 2, 2 A = D 
can be written 

(29) I 12 F6-2 n £c, 

Multiplication by E f on the left gives 

(30) E , ^ ]2 Fb-=E f ^ n Ec. 

Since E f ^ u E is nonsingular, comparison of (27) and (30) shows that c = va, 
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and therefore k = vg. Thus 

(31) 2 u /i = 

In a similar fashion we show that 

(32) ^ 2{ g = v^ 22 h. 

Therefore A (m + 1) ，g = a (m + l) ，A = 7(" 1 + 1 ) is another solution. But this is 
contrary to the assumption that A (m) , was the last possible solution ， 

Thus m— p v 

The conditions on the A’s ， a’s and 7 ’s can be summarized as 

(33) 

(34) A’U! = A, 

(35) r;s 22 rw， 

Let r 2 = (*y(A + I ) … -y (P2) ) be a p 2 X (p 2 - p,) matrix satisfying 


(36) ^2^ = 0, 

(37) ns 22 r 2 = /， 

Any r 2 can be multiplied on the right by an arbitrary P]^ 

orthogonal matrix. This matrix can be formed one column at a time: 7 (p, + n 
is a vector orthogonal to S 22 r i and normalized so 7 (Pl + 1 ) ’ 2 22 7 (Pl + h = 1; 
ypi+ 2 ) ； s ^ vector orthogonal to 222 ( 1 ^ 7 (Pl + l) ) and normalized so 
=1; and so forth. Let T = (F { r 2 ); this square matrix is 
nonsingular since T '艺 22 r=/. Consider the determinant 
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Except for a constant factor the above polynomial is 



Thus the roots of (14) are the roots of (38) set equal to zero ， namely, 
入 = 土入 l, 〉，i = 1” " ， Pi，and 入 = 0 (of multiplicity p 2 - p'). Thus (A, ， … ，入 J 
= ( 入 , ， … ， 、， () ， … ， 0 ，一 • ，一 Aj, The set {A 0)2 }, is the set 

Uf} 7 1 ，…，/?卜 To show that the set {A (0 }, / = 1,,,is the set {入山 
i = we only need to show that A (,) is nonnegative (and therefore is 

one of the A,, i = 1 ， . ， . ， p : ). We observe that 


(40) S l27 (r)= _ 入 ㈠ ⑺）， 

(41) S 21 (- a ⑺卜 - A ⑺ S 227 ⑺； 

thus, if A (r) ， a ⑺， *y (r ) is a solution, so is 一入 (0 ，一《 (0 , 7 ⑺ .If A ⑺ were 
negative, then - A ⑺ would be nonnegative and - 入⑴ k 入 (’)，But since A (r) 
was to be maximum, we must have A ( r ) 之 一 A ⑺ and therefore A( r ) > 0. Since 
the set { 入 ⑴} is the same as {A f }，i = 1， ，•♦，/?” we must have A 0) = A r 
Let 

'ur 

(42) U= '■ =A , X (I) , 

(43) v (,) = : =r;x ,2) , 

(44) F® = : = r 2 x a \ 


The components of U are one set of canonical variates, and the components 
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Definition 12.2.L Let X - (X (]), X (2) 7, where X ({) has p { components 
and X (2) has p 2 ( = p - p' >p } ) components. The rth pair of canonical variates 
is the pair of linear combinations U r = ot (r), X (]) and V r = y (r), X (2 \ each of unit 
variance and uncorrelated with the first r - 1 pairs of canonical variates and 
having maximum correlation. The correlation is the rth canonical correlation. 


Theorem 12.2.1* Let X = X (2), ) f be a random vector with covariance 

matrix 2. The rth canonical correlation between X (】） and is the rth largest 
root of (14). The coefficients of ot (中 X (l) and y (r), X (2) defining the rth pair of 
canonical variates satisfy (13) for \ = \ r and (3) and (4). 


We can now verify (without differentiation) that U ]f V x have maximum 
correlation. The linear combinations a l U ^ and ( 6 ’ 厂）尤 ( 2 ) 

are normalized by a 1 a = 1 and b'b = 1. Since A and V are nonsingular, any 
vector a can be written as Aa and any vector 7 can be written as Tb, and 
hence any linear combinations a f X (l) and y r X (2) can be written as a’U and 
b f V. The correlation between them is 

Pi 

(47) a 1 (A 0)6= E 

i^\ 

Let k^/yjziKa,) 2 = c ( . Then the maximum of 
a f ( A 0)6 = \/E( A^a ,) 2 Ec, 力， 
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with respect to b is foi b ( = c i9 since Y.c i b i is the cosine of the angle between 
the vector b and 0, _" ， 0). Then (47) is 

= \ / 印 (a?-MK + a ?， 

and this is maximized by taking a f = 0, £ = 2,..., Thus the maximized 
linear combinations are f7, and V { . In verifying that U 2 and V 2 form the 
second pair of canonical variates we note that lack of correlation between 
and a linear combination a f U means 0 = = a x and lack 

of correlation between V l and b’V means 0 - b { . The algebra used above 
gives the desired result with sums starting with i = 2. 

We can derive a single matrix equation for ct or y. If we multiply (11) by A 
and (12) by have 

(48) A2 12 7 = A 2 2 n a, 

(49) = \y. 

Substitution from (49) into (48) gives 

(50) ^\2^22 = 

or 

(51) (^n^zz 1 ^2i - A 2 2 n )a = 0. 

The quantities Aj, …， satisfy 

(52) IU” 1 2 21 - = 0 ， 

and a (1) ， ." ， a (Pl) satisfy (51) for A 2 - Aj,..A^, respective》. The similar 
equations for y (P2) occur when V 2 = \ 2 pi are substituted with 

(53) (UWA 2 ：^)—- 

Theorem 12,2.2, The canonical correlations are invariant with respect to 
transformations X {1) * = where C l is nonsingular, f = 1,2, and any func¬ 

tion of 2 that is invariant Is a function of the canonical correlations. 

Proof. Equation (14) is transformed to 

(54) 


— ACjSjjCJ 




0 

- 

芝 12 

C\ 0 

df i 

一 AC 2 2 2 2 C*2 


0 

c 2 _ 

S 21 

— ASu 

• 0 c 2 
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and hence the roots arc unchanged. Conversely, let /d u ， t 2 ) be a 
vector-valued function of 2 such that = 

/(2 U ，for all nonsingular C, and C 2 If C\ — A and C 2 - I \ then 

(54) is (38X which depends only on the canonical correlations. Then / = 

/a(A ， o)，n ■ 

We can make another interpretation of these developments in terms of 
prediction. Consider two random variables U and V with means 0 and 
variances cr w 2 and and correlation p. Consider approximating [/ by a 
multiple of V 7 say hV\ then the mean squared error of approximation is 

(55) S{V — bV) 2 — irj - 2b(r u (r r p + h 2 (r f 2 

=%:(1 P 2 ) +{b(r,~ pa u )\ 

This is minimized by taking b = a n p/a r . We can consider bV as a linear 
prediction of U from V\ then — p 2 ) is the mean squared error of 
prediction. The ratio of the mean squared error of prediction to the variance 
of U is a u 2 (l - p 2 )/% 2 = 1 — p 2 ; the complement is a measure of the relative 
effect of K on f/ or the relative effectiveness of V in predicting V. Thus the 
greater p 2 or I p| is，the more effective is V in predicting U\ 

Now consider the random vector X partitioned according to (1 )， and 
consider using a linear combination V = to predict a linear combina¬ 

tion U - fx r X il \ Then V predicts V best if the correlation between V and V 
is a maximum. Thus we can say that a (lw A" (1) is the linear combination of 
X (1) that can be predicted best, and y (] ^X (2) is the best predictor [Hotelling 
(1935)], 

The mean squared effect of K on f/ can be measured as 

( 56 ) ^(bV ) 2 - p 2 ^yiV 2 ^ 

and the relative mean squared effect can be measured by the ratio 
^{bV) 2 /SLJ 2 = p 2 . Thus maximum effect of a linear combination of X {2) on 
a linear combination of X {1) is made by 7 (!), ^ (2) on 

In the special case of p y = 1, the one canonical correlation is the multiple 
correlation between A^ (l) - X } and X {2 \ 

The definition of canonical variates and correlations was nvade in terms of 
the covariance matrix 2 = S{X- SX\X- SX) 1 f We could extend this 
treatment by starting with a normally distributed vector Y with p + p } 
components and define X as the vector having the conditional distribution of 
the first p components of Y given tbe value of the last components. This 
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would mean treating wi'h mean = ©v^ 3) ; the elements of the 
covariance matrix would be the partial covariances of the first p elements 
of y. 

The interpretation of canonical variates may be facilitated by considering 
the correlations between the canonical variates and the components of the 
original vectors [e,g 、 Darlington, Weinberg, and Wahlberg (1973)]. The 
covariance between the /th canonical variate U } and X { is 

(57) ^X k X^ £ 我 .. 

k=\ k=\ 

Since the variance of U } is 1, the correlation between l) } and X ( is 

(58) Corr(V,)= 印 卢％ . 

An advantage of this measure is that it does not depend on the units of 
measurement of X } . However, it is not a scalar multiple of the weight of X { 
in U } (namely ， a! ;) X 

A special case is = /, S 22 = Then 

(59) a a-/, r r-/, A^ 12 r-( a o). 

From these we obtain 

(60) 2 12 -A( A 0)r\ 

where A and T are orthogonal and A is diagonal. This relationship is known 
as the singular value decomposition of 2 12 . The elements of A are the square 
roots of the characteristic roots of 2 l2 2\ 2 i and the columns of A are 
characteristic vectors. The diagonal elements of A are square roots of the 
(possibly nonzero) roots of 5^2 12 , and the columns of T are the character¬ 
istic vectors. 


12.3. ESTIMATION OF CANONICAL CORRELATIONS 
AND VARIATES 


12.3.1. Estimation 

Let x,,...,ar A be N observations from 2). Let x a be partitioned into 
two subvectors of p } and p 2 components, respectively, 
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The maximum likelihood estimator of 2 [partitioned as in (2) of Section 1Z2] 
is 


( 2 ) 




玄 u 


i. 2 ) 


22 



E (n(x a - 无 ) 


= N 




EOf-i ⑴ )(<)-f 2 ))' 

E(xi 2 )-jf (2 ))(4 2 )-i ( 2 ))， 


The maximum likelihood estimators of the canonical correlations A and 
% 

the canonical variates defined by A and T involve applying the algebra of the 

At 

previous section to z. The matrices A, A, and are uniquely defined if we 
assume the canonical correlations different and that the first nonzero ele¬ 
ment of each column of A is positive. The indeterminacy in 3f 2 allows 
multiplication on the right by a (p 2 一 pj X (p 2 - Pi) orthogonal matrix; this 
indeterminacy can be removed by various types of requirements, for example, 
that the submatrix formed by the lower p 2 rows be upper or lower 
triangular with positive diagonal elements. Application of Corollary 3.2.1 
then shows that the maximum likelihood estimators of A l7 A p are the 
roots of 


(3) 


ii2 

i 2 . ~i%2 


A A 

and the ;th columns of A and Tj satisfy 


= 0, 

今 (/) 

(5) ii ^(；)= 1； ^0>^ 22 ^0) =L 
f 2 satisfies 
( 6 ) 

(7) f^i 22 f 2 = /. 


(4) 


A 


21 




12 


22 


When the other restrictions on r 2 are made, A, f, and A are uniquely 
defined 
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Theorem 123 A. Let x N be N observations from 2). Let 2 be 

partitioned into p } and /? 2 (/? ( < p 2 ) rows and columns as in (2) in Section 12.2, 
and let x a be similarly partitioned as in (1). The maximum likelihood estimators 
of the canonical correlations are the roots of (3\ where %” are defined by (2). 
The maximum likelihood estimators of the coefficients of the ;th canonical 
components satisfy (4) and (5 )，j = 1， _ • • ， p { ； the remaining components satisfy 


In the population the canonical correlations and canonical variates were 
found in terms of maximizing correlations of linear combinations of two sets 
of variates. The entire argument can be carried out in terms of the sample. 
Thus and 7 (l), x^ 2) have maximum sample correlation between any 

linear combinations of and x ( ^\ and this correlation is l v Similarly ， 
and 7 (2), x^ 2) have the second maximum sample correlation, and so 

forth. 

It may also be observed that we could define the sample canonical variates 
and correlations in terms of 5, the unbiased estimator of 2. Then a (;) 
- and l ; satisfy 


(8) 

S 12 c(’) = 


(9) 



(10) 

a 0 >5 n a ( ^= 1, 

c U) 'S 22 


We shall call the linear combinations a (;) x。) and c (j ， )/ x ( a 2) the sample 
canonical variates. 

We can also derive the sample canonical variates from the sample correla¬ 
tion matrix, 



Let 

(12) 
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Then we can write ( 8 ) through (10) as 


(14) /? I2 (V( ;) ) =/，#，)， 

(15) /? 2 I (V( 》 )=H 2 (V 0 ))， 

(16) (S ia ^yR n (S ia ^) = l r ( 以 ( 》 )' 及 22 ( V ” 卜 1 . 


We can give these developments a geometric interpretation. The rows of 
the matrix (x p ."， ％ ) can be Interpreted as p vectors in an /^dimensional 
space, and the rows of (x! — x，• • • ， — x) are ihe p vectors projected on the 
(N — l)-dimensional subspace orthogonal to the equiangular line. Denote 
these as Any vector u* with components ⑴， ••• ， xf- 

x (l) )^ a,xf + +a p Xp i is in the -space spanned by x*and a 
vector v* with compon< nts 7 '(x! 2 ) — x (2 \ • • • ， xg) 一 i (2) ) = ytX*^ , 
+ … i s 比此 /? 2 -space spanned by + x*. The cosine of the 
angle between these two vectors is the correlation between u a = and 

v a — y r x^\ a = 1,M. Finding a and 7 to maximize the correlation is 
equivalent to finding the vectors in the p r space and the p 2 -space such that 
the angle between them is least (i.e” has the greatest cosine). This gives the 
first canonical variates, and the first canonical correlation is the cosine of the 
angle. Similarly, the second canonical variates correspond to vectors orthogo¬ 
nal to the first canonical variates and with the angle minimized. 


12.3^* Computation 

We shall discuss briefly computation in terms of the population quantities. 
Equations (50) ，（ 51)，or (52) of Section 12,2 can be used. The computation of 
^ 12^22 ^21 can be accomplished by solving 2 ) 21 =S 22 F for S 2l and 
then multiplying by 2 12 . If p } is sufficiently small, the determinant 
\^ 12 ^ 22 ^ 2 \ - can be expanded into a polynomial in v, and the 

polynomial equation may be solved for v. The solutions are then inserted 
into (51) to arrive at the vectors a. 

In many cases p x is too large for this procedure to be efficient，Then one 
can use an iterative procedure 

2n 1 2l2222 l ^21 Ot (0 = A2 ( f ' + l)a(l + 1), 


(17) 
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starting with an initial approximation ot(0); the vector a(i + 1) may be 
normalized by 

(18) ot(i + lV2 u ct(/ + l) =1. 

The A 2 (i + 1) converges to 入 ^ and a(i + 1) converges to a (1) (if > A 2 ). 
This can be demonstrated In a fashion similar to that used for principal 
components, using 

(19) 

from (45) of Section 12.2. See Problem 12.9. 

The right-hand side of (19) is where a (J), is the ith row of 

A 一 1 . From the fact that A , 2 II A = / > we find that A , 2 11 = A" 1 and thus 
ot ⑴ ’2 U = St (t K Now 

(20) - 咖 ⑴&⑴、 £ 

1=2 

00 ... o' 

0 入专 … 0 

• • A— 1 . 

• • • 

• a • 

0 0 … AJ, 

The maximum characteristic root of this matrix is If we now use this 
matrix for iteration，we will obtain \\ and aS 2 \ The procedure is continued 
to find as many A? and ot (i) as desired. 

Given A,, and a ⑴， we find from 2 21 a (,) = A | .2 2 2 # Y^ ) or equivalently 
(l/AjSgSua 0 ) = 7 ⑴ • A check on the computations is provided by com¬ 
paring X 12 7 c，) and 

For the sample we perform these calculations with Z f j or substituted 
for It is often convenient to use R” in the computation (because 
一 1 < r t} < 1) to obtain S〆 1 、and 5 2 c c；) ; from these a ⑺ and can be 
computed. 

Modern computational procedures are available for canonical correlations 
and variates similar to those sketched for principal components. Let 

(21) = ，…，咁⑴)， 

(22) Z 2 =(#-i( 2 ) ， ,.. ， 4 2 )-i( 2 )). 
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The QR decomposition of the transpose of these matrices (Section 11.4) is 
Z\ — QfRi ， where Q ( Q ( - I p ^ and is upper triangular. Then S" = Z,Z 卜 
R’iQ’iQjRj，“ j = 1,2, and S" =£•= 1 ， 2. The canonical correlations are 
the singular values of Q\Q 2 and the square roots of the characteristic roots of 
(by Theorem 12,2.2). Then the singular value decomposition 
of Q v Q 2 is P(L 0)T, where P and T are orthogonal and L is diagonal. To 
effect the decomposition Householder transformations are applied to the left 
and right of Q V Q 2 to obtain an upper bidiagonal matrix, that is，a matrix with 
entries on the main diagonal and first superdiagonal. Givens matrices are 
used to reduce this matrix to a matrix that is diagonal to the degree of 
approximation required. For more detail see Kennedy and Gentle (1980), 
Section 7.2 and 12.2, Chambers (1977), Bjorck and Golub (1973), Golub and 
Luk (1976), and Golub and Van Loan (1989). 


12A STATISTICAL INFERENCE 
124 丄 Tests of Independence and of Rank 

In Chapter 9 we considered testing the null hypothesis that and are 
independent, which is equivalent to the null hypothesis that S l2 = 0. Since 
A'5 12 r = (A 0), it is seen that the hypothesis is equivalent to A = 0, that is, 
Pi = … = p Px = 0. The likelihood ratio criterion for testing this null hypothe¬ 
sis is the N/2 power of 


⑴ 


^11 ^12 
-421 ^22 


A , 0 ! 

0 t' 


/4 U A l2 

^21 ^22 


A0 

0 f 

U 11 I.M 22 I 1 A-| ■|f ， l 

\^u\ '\ A n\ |A| -|f| 


/A0 
A / 0 

i\ / i■ 1 



l/H/l 


where r t = l Y 之 … ^ r P, =/ p^ 0are the pi possibly nonzero sample canoni¬ 
cal correlations. Under the null hypothesis, the limiting distribution of 
Bartlett’s modification of _2 times the logarithm of the likelihood ratio 
criterion, namely, 

- z(P + 3)] L log(l 


(2) 
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is x 2 with p x p 2 degrees of freedom. (See Section 9.4.) Note that it is 
approximately 

(3) /v£r, 2 = Ntr/l- , /l l2 /l 2 - 1 U2P 

1 

which is N times Nagao's criterion [(2) of Section 9.5]. 

If 2 l2 竽 0， an interesting question is how many population canonical 
correlations are different from 0; that is，how many canonical variates are 
needed to explain the correlations between and X (2) ? The number of 
nonzero canonical correlations is equal to the rank of 2 12 , The likelihood 
ratio criterion for testing the null hypothesis H k : p fc + 1 = = p p ^ - 0, that is, 
that the rank of S l2 is not greater than k 9 is n/^ fc+1 (l - [Fujikoshi 
(1974)]. Under the null hypothesis 

(4) + log(l - rj) 

t^k + l 

has approximately the ^^distribution with (p x - k)(p 2 - k) degrees of free¬ 
dom. [Glynn and Muirhead (1978) suggest multiplying the sum in (4) by 
N — k — ^(p + 3) + Ef =1 (l /rf)\ see also Lawley (1959).] 

To determine the numbers of nonzero and zero population canonical 
correlations one can test that all the roots are 0; if that hypothesis is rejected, 
test that the l smallest roots are 0; etc. Of course, these procedures are 
not statistically independent, even asymptotically* Alternatively, one could 
use a sequence of tests in the opposite direction: Test p p ^ — 0, then p Pi -\ - 
p Pi = 0, and so on, until a hypothesis is rejected or until 2 12 = 0 is accepted* 
Yet another procedure (which can only be carried out for small p x ) is to test 
p Px = 0, then p p ^ ] =0, and so forth. In this procedure one would use r- to 
test the hypothesis p ; = 0. The relevant asymptotic distribution will be 
discussed in Section 12.4,2. 


12.4.2, Distributions of Canonical Correlations 

The density of the canonical correlations is given in Section 134 for the case 
that X i2 ^ 0, that is, all the population correlations are 0, The density when 
some population correlations are different from 0 has been given by Constan¬ 
tine (1963) in terms of a hypergeometric function of two matrix arguments. 

The large-sample theory is more manageable. Suppose the first k canoni¬ 
cal correlations are positive, less than 1， and different, and suppose that 
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Pi— k correlations are 0. Let 

i = 1. v. v v /r, 

z i — Nr }, i = k + l . p { . 

Then in the limiting distribution z 1? ..and the set \ arc 

mutually independent, z t has the limiting distribution NiO. 1). / = 1_ 、人 •. 

and the density of the limiting distribution of + z / , i h 

⑹ exp(-^fi fc+1 z,)_ 

[i( - M] [4( p 2 -k)} 

- ri ri 

i = k i.j^k + \. 

•<， 

This is the density (11) of Section 13.3 of the characteristic roots of a 
(Pi —/c)-order matrix with distribution P 2 ^ Note that the nor¬ 

malizing factor for the squared correlations corresponding to nonzero popu¬ 
lation correlations is 很 、 while the factor corresponding to zero population 
correlation is /V. See Chapter 13, 

In large samples we treat r} as N[ p^(l/N)4p^(l - p^) 2 ] or r, as 
N[ p,,(l/NXl ~ p^) 2 ] (by Theorem 4.2.3) to obtain tests of p r or confidence 
intervals for p r Lawley (1959) has shown that the transformation z,= 
tanh -1 (r/) [see Section 4,2.3] does not stabilize the variance and has a 
significant bias in estimating ^ = tanh —V p t \ 


(5) 




P ： 


2 a (1 - pf)' 


12.5. AN EXAMPLE 

In this section we consider a simple illustrative example, Rao [(W52), P- 2451 
gives some measurements on the first and second adult sons in a sample of 25 
families, (These have been used in Problem 3.1 and Problem 4,41,) Let x lrt 
be the head length of the first son in the orth family, x 2n be the head breadth 
of the first son ， x 2a be f，he head length of the second son、and be the 
head breadth of the second son. We shall investigate the relations between 
the measurements for the first son and for the second. Thus x { ^ }, = > 
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and ' = (.v ?a , .v 4u l. The data can be summarized as T 


(1) x -(185.72, 151.12, 183,84, 149.24), 



[95.2933 

52.8683 

69.6617 

46.1117 、 



^12 

s = 

52.8683 

54.3600 

51.3117 

35.0533 


^11 

69.6617 

51.3117 

100.8067 

56.5400 


S 21 

^22 


[46.1117 

35.0533 

56.5400 

45.0233] 




The matrix of correlations is 









1.0000 

0,7346 

0.7108 

0.7040] 





(2) i?: 


0.7346 

1.0000 

0.6932 

0.7086 


卜 i? 



0.7108 

0,6932 

L0000 

0,8392 


R 

n)' 



10.7040 

0,7086 

0.8392 

1.0000 j 






All of the correlations are about 0.7 except for the correlation between the 
two measurements on second sons. In particular, R u is nearly of rank one, 
and hence the second canonical will be near zero. 

Wc compute 

0,333 205) 

0.428976 )， 

0.538841^ 

0.534950/ 

The determinantal equation is 


(4) 


R；；R 


22 ^21 


U22 h 


0,405 769 
,0.363480 

0.544311 

.0.538841 


n= 0.544311 - 1.00000.538841 - 0,7346^ 

U 0,538841 - 0.7346^ 0.534950 — l.OOOOp 

= 0.460363〆 — 0.287596!/+ 0.000 830. 

The roots are 0.621816 and 0.002900; thus = 0.788 553 and l 2 = 0,053 852 
Corresponding to these roots are the vectors 


( 6 ) 


where 

(7) 


v (1) 


0.552166 1 

0.521548 



1.366501) 

-1.378467 




0 


' 9.7618 0 1 

0 


— 

- 

0 7.3729 


Rao^ computations arc in error: his last “ditTereiux”is ineoircct. 
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We apply to 5,a (,) to obtain 


( 8 ) 

where 

(9) 


以⑴ =( 0 0 :3， 〜 


( 2 ) 


1,767 281 、 
1.757 288 1 5 


S 2 


[V^T 




10.0402 


6.7099 广 


We check these computations by calculating 

⑽ Wmv ⑴) {:盖副， ^( w ( —;: 3 3= 


The first vector in (10) corresponds closely to the first vector in (6); in fact，it 
is a slight improvement, for the computation is equivalent to an iteration on 
S x a a \ The second vector in (10) does not correspond as closely to the second 
vector in (6)* One reason is that l 2 is correct to only four or five significant 
figures (as is r 2 = / 妾 ) and thus the components of S 2 c (2) can be correct to 
only as many significant figures; secondly, the fact that S 2 c (2) corresponds 
to the smaller root means that the iteration decreases the accuracy instead of 
increasing it. Our final results are 


(ID 


(1) 

/, = 0.789, 

(2) 

0.054, 


f 0.0566) I 

f 0.1400 、 

1 0,0707 1’ 1 

,-0.1870 ) 


/ 0.0502 - 

' 0.1760 \ 

(0.0802)’ 

、 - 0.2619 J 


The larger of the two canonical correlations ， 0.789, is larger than any of 
the individual correlations of a variable of the first set with a variable of the 
other. The second canonical correlation is very near zero. This means that to 
study the relation between two head dimensions of first sons and second sons 
we can confine our attention to the first canonical variates; the second 
canonical variates are correlated only slightly, The first canonical variate in 
each set is approximately proportional to the sum of the two measurements 
divided by their respective standard deviations; the second canonical variate 
in each set is approximately proportional to the difference of the two 
standardized measurements* 
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12.6 - LINEARLY RELATED EXPECTED VALUES 
12»6»L Canonical Analysis of Regression Matrices 

In this section we develop canonical correlations and variates for one 
stochastic vector and one nonstochastic vector. The expected value of the 
stochastic vector is a linear function of the nonstochastic vector (Chapter 8). 
We find new coordinate systems so that the expected value of each coordi¬ 
nate of the stochastic vector depends on only one coordinate of the non- 
stochastic vector; the coordinates of the stochastic vector are uncorrelated in 
the stochastic sense, and the coordinates of the nonstochastic vector are 
uncorrelated in the sample. The coordinates are ordered according to the 
effect sum of squares relative to the variance. The algebra is similar to that 
developed in Section 12-2. 

If X has the normal distribution N(|x, X)with X, ijl, and 2 partitioned as 
in (1) and (2) of Section 12,2 and (jl = (|jl ⑴ ’ ， |jl ( 2) ’)’，the conditional distribu¬ 
tion of A r(1 ) given x (2) is normal with mean 

(1) #) + p ( x ( 2 )-，))， 
and covariance matrix 

(2) ^11-2 ^ ^11 ~ 艺 12 芝 22 1 艺 21. 

Since we consider a set of random vectors X [ l \,,,, with expected values 
depending on xp) ， .. ， ， xg) (nonstochastic), we can write the conditional 
expected value of as t + P(x^ 2) — Jc (2) ), where t ― ft (l) + p(jc (2) — |jl (2) ) 
can be considered as a parameter vector. This is the model of Section 8,2 
with a slight change of notation. 

The model of this section is 

(3) ^ = t+P(^ 2 )-x«), 0=1 ， … ， N, 

where Jtf) ， … ， xg) are a set of nonstochastic vectors (q X 1) and x (2) = 
N— 1 E^ sl x^ 2 )，The covariance matrix is 

(4) ( 对 )- Z^))’ = 

Consider a linear combination of A^ 1 )，say Up Then has 

variance ot ,J ¥a. and expected value 

(5) 川小 = a ， T + a 'p ( 成 ) - 叫 . 
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The mean expected value is (l/A/')E^ aI a f T, and the mean sum of 

squares due to x( 2 ) is 

(6) I E (叫-«、) 2 =去 E 

<t>^l <1>^ I 

=a , pS 2 2 p ， «, 

We can ask for the linear combination that maximizes the mean sum of 
squares relative to its variance; that is, the linear combination of dependent 
variables on which the independent variables have greatest effect. We want 
to maximize ( 6 ) subject to a - 1, That leads to the vector equation 

(7) (PS 22 P 
for k satisfying 

(8) IPW-K 屮 I =0 ， 

Multiplication of (7) on the left by a f shows that ot’pS 22 p’ot = k for a and 
k satisfying a = 1 and (7); to obtain the maximum we take the largest 
root of ( 8 )，say k v Denote this vector by a (1) ， and the corresponding random 
variable by U li)} = The expected value of this first canonical vari¬ 
able is 炎 = ot (1) ’[p(x^ 2) 〜 x ⑵） + t]- Let a (l), p = ky {l) \ where k is 

determined so 

(9) 1 =去 E [7 ⑴巧-士 

沴 =1 \ 7J=1 , 

1 

= 7 (1)/ S 2 2T (l) - 

Then = Let t/ 1( ^ = 7 ⑴ ’ （ jc? - i ⑵)， Then -f a (l), T, 

Next let us obtain a linear combination U 办 = ot’X^P that has maximum 
effect sum of squares among all linear combinations with variance 1 and 
uncorrelated with that is, 0 = A% — 杏 ) ’ =a'f a (I) . 

As in Section 12.2, we can set up this maximization problem with Lagrange 
multipliers and find that a satisfies (7) for some k satisfying ( 8 ) and 
a ^ a = L The process is continued in a manner similar to that in Section 
12.2. We summarize the results. 

The jth canonical random variable is U Jd) = ot ⑴ 'A^ n ，where ot ⑴ satisfies 
⑺ for k= Kj and a ⑴ = 1 ; k { > k 2 > > k Pi are the roots of ( 8 ), 
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We shall assume the rank of p is p, < p 2 . (Then > 0,) U J<fr lias the largest 
effect sum of squares of linear combinations that have unit variance and are 
uncorrelated with Let y u) = (1/ Vj = ot ⑴ ’t ， and 

o Jd , = 7 0), (^ 2) -jc (2) X Then 


( 10 ) 

( 11 ) 


( 12 ) 






i N 、 

■ jj L u jv 

i 

— 1 





N 

r x N ' 

1 j N ^ 

L 


u j<p i "L % 

<p «1 

l~ 1 / 

\ v 

/ 


i 丰 j' 


If p 2 > then 7 (" l + 1) ，，， ”V /> ’）can he chosen so ^ l + i, 7 (/7| + l)/ (x^ - 
x (2) ),,,,, u p ^ = 一 x (2) ) satisfy (ll) and (12), 

Let A = (ot 。） … ot^ 1 ) ， r! = ("y(*) … "y ( 灼 >) ， r 2 = ("yh + ^ ” ， *y(^ 2 ))，A = 
diag(S { ， … ， 5 pi ) = diag(/i7, … ， ^ ■) ， A 1 X^ [ \ ^ l) ^ r[(x^ - x (2) \ ^ 2) 
=r : (x^ 2) -x (2) \ and 0=1，,”，M Then 


(13) 啊 - 叫 肌 ― ^y=A^A-L 


(14) 



= + v, 

(fy 一 1 ，，，，， N ， 


1 N 
丄 ] c 

Ai “ 

\ 

f 

\ 

1 ,v ( 
jj E i， ,, 〜 - 
^=1 / \ 

1 N 

- 77 E ^ =r 


The random canonical variates are uncorrelatcd and have variance L The 
expected value of each random canonical variate is a multiple of the corre 
sponding nonstochastic canonical variate plus a constant. The nonstochastic 
canonical variates have sample variance 1 and are uncorrelated in the 
sample. 

If p { >p 2 , the maximum rank of p is p 2 and K pi+1 = … =k P) = 0, In that 
case we define A, - (a (l \, -., a ( ^ 2) ) and A 2 = (a ( ^ 2+,) … ， a (Pl) )，where 
a (l) ,(corresponding to positive k’s) are defined as before and 
a f/ ^"， …， are any vectors satisfying a ⑺ ， 少 a ⑴ =1 and a ⑺ ， fa ⑴篇 
0 ， i 丰 j' Then ^ ^ v n / = and i^p 2 + 

1，，，，， p I ， 

In either case if the rank of p is r < min(/? l5 p 2 \ there are r roots of (8) 
that are nonzero and hence SU^ = + for i = 1 ， …〆 ， 
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12.6*2. Estimation 

Let jt.f)， ，“， be a set of observations on X [ l \,,,, X》 1 ) with the probability 
structure developed in Section 12,6,1, and let jcf) ， … ， xg) be the set of 
corresponding independent variates. Then we can estimate t, p, and f by 

(16) T= ^ H ⑴， 

(f)= 1 

(17) P = A l2 A 72 1 = 

(18) *=去 E [ 哎 -5 ⑴ - 自 (4 2) - 叫 - 沪 - 自 (f-〆)[ 

L 

=^(/4u A X2 A^A 2 \) - S u - S l2 S22 S 2 \ 9 

where the A y s and 5’s are defined as before. (It is convenient to divide by 
n = N — 1 Instead of by N; the latter would yield maximum likelihood 
estimators.) 

The sample analogs of (7) and (8) are 


(19) 

0 = (P5 22 p f -it^)a 


=[5j 2 522^2! - A ： (5jj - 5 12 522^2! )] a, 

( 20 ) 

0=iP5 22 p f -/c^| 


=W” 1 〜 -k(S n - S l2 S^ 2 'S 2l )\. 


The roots … 之人 、 ■ of (20) estimate the roots k, 之 … k of (8)，and 
the corresponding solutions 5 (1 ) ， … ， a (P)) of (19)，normalized by 5(’)' 企 5 (/ ) = 1 ， 
estimate a (1) ，，， ，， a (Pl) , Then c (;) = (1/y^)p f S (;) estimates y (j \ and n 广 
a (;), Jc (l) estimates The sample canonical variates are and c (;), (jc^ 

- Jc (2) ), j = 1 ， … ， Pi ， <f>- N. If p 1 > p 2 , then pi — p 2 more a (;)5 s can 
be defined satisfying 5")’ 企 5 ⑴ =1 and 5 ⑺ ’ 企 5⑴= 0, / 句 

12,6,3, Relations Between Canonical Variates 

In Section 123, the roots l x > … > / pi were defined to satisfy 

(21) 0 = 々 _,^ 2 = (-1)' 2 /^^'I5 22 ||5 12 5 2 V5 21 -/ 2 5 n |. 

^21 一 /l3 22 
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Since (20) can be written 

(22) 0 = |(1+fc)S 12 SA f ： 21 -A:S ll |， 

we see that If = fe f y(l + k t ) and k t — /, 2 /(l -/?)，！’ = 1 ， … ， p】，The vector a (,) 
in Section 12,3 satisfies 

(23) Q = (S l2 S^S 2l -lfS n )a^ 

= 卜 I2 S 2 ;iS 2I - i +/c, ^ n | a(i) 

^ i + < [S| 2 5 2 ^ l S 2l — k { (^S n - S i2 Sf 2 ! S 2i )]fl ( ’)， 

which is equivalent to (19) for k = k f . Comparison of the normaliza¬ 
tions a ⑴ = 1 and a (,)f (S n ^ S lz S 2 2 l S 2l )a (l) = 1 shows that a (i) = 
(1/v/l-/”，)，Then c (J) = (l/^)S 2 ^S 2l 5 (/) = c (;) . 

We see that canonical variable analysis can be applied when the two 
vectors are jointly random and when one vector is random and the other is 
nonstochastic. The canonical variables defined by the two approaches are the 
same except for normalization. The measure of relationship between corre¬ 
sponding canonical variables can be the (canonical) correlation or it can be 
the ratio of “explained” to “unexplained” variance ， 

12,6.4，Testing Rank 

The number of roots Kj that are different from 0 is the rank of the regression 
matrix p- it is the number of linear combinations of the regression variables 
that are needed to express the expected values of X^K We can ask whether 
the rank \s k <k <p { if /?, <p 2 ) against the alternative that the rank is 
greater than k. The hypothesis is 

(24) H k : K k + l - = k Pi = 0. 

The likelihood ratio criterion [Anderson (1951b)] is a power of 

( 25 ) n (i+/o _l = n (i+/, 2 )- 

i-k + \ i = k +I 

Note that this is the same criterion as for the case of both vectors stochastic 
(Section 12.4), Then 

-…-奴 P+ 3 )] [ log(l -/, 2 ) 

i^k + I 


(26) 
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has approximately the 尤 （distribution with (p { - k\p 2 -k) degrees of 
freedom. 

The determination of the rank as any number between 0 and p { can be 
done as in Section 124, 

12.6.5, Linear Functional Relationships 

The study of Section 12.6 can be carried out in other terms. For example, the 
balanced one-way analysis of variance can be set up as 

(27) Y aj -= v a + + U aj9 a= w , 卜 1, …，/， 

where <^t/ a = 0 , ^U a U^ = E 二【 1 ^ = 0 , and 

(28) © v a = 0, cx- l _ ， / 7i, 

where © is ^ X/?, of rank q (<p\). This is a special case of the model of 
Section 12.6,1 with 4>= \ ”” 、 N 、replaced by the pair of indices (a ， f\ 
巧 )％， and p(xf -x (2) )-v a by use of dummy variables as in 
Section 8 , 8 , The rank of (v, ””， v w ) is that of p ， namely, r = There 

are q roots of ( 8 ) equal to 0 with 

(29) p5 22 p^/£ vX- 

a* I 

The model (27) can be interpreted as repeated observations on v a + with 
error The component equations of (28) are the linear functional relation¬ 
ships. 

Let y a = (l//)Ey« I ^ a; and y = ("m)LU a ， The sum of squares fo? effect 
is 

(30) 

a « 1 

with m 一 1 degrees of freedom, and the sum of squares for error is 

(31) G = £ D 九) , 企 

a*1 j =I 

with m (/ — 1) degrees of freedom. The case p! < p 2 corresponds to </ 
Then a maximum likelihood estimator of © is 


(32) 


0 = (5( rH> ，…， 5 ⑻ /， 
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and the maximum likelihood estimators of v a are 

(33) K-^^{y a -y), «=i ，…， «• 

The estimator (32) can be multiplied by any nonsingular qXq matrix on the 
left to obtain another. For a fuller discussion，see Anderson (1984a) and 
Kendall and Stuart (1973). 


12,7. REDUCED RANK REGRESSION 


Reduced rank regression involves estimating the regression matrix P in 
<rX ( ^\X (2) = P^T (2) by a matrix p of preassigned rank k. In the limited-infor¬ 
mation maximum likelihood method of estimating an equation that is part of 
a system of simultaneous equations (Section 12.8), the regression matrix is 
assumed to be of rank one less than the order of the matrix. Anderson 
(1951a) derived the maximum likelihood estimator of p when the model is 

(1) 把 ) = T + P«)-W))+Z „， …， N ， 

the rank of p is specified to be k (<p,), the vectors J^ 2 ) ， … ， x$) are 
nonstochastic, and Z Q is normally distributed. On the basis of a sample 
X 】， … ， x N ，define 2 by (2) of Section 123 and A, A, and r by (3) ，（ 4)，and 
(5). Partition A = diag(A A 2 X A = (Aj,A 2 ), and f = p f 2 )，where A” 
A,, and f j have k columns. Let <!>, = - \ 

Definition 12.7.1 (Reduced Rank Regression) The reduced rank regression 
estimator in (1) is 


(2) B k — S V ^rjf{ = SyyAj A jf { — 

Y\ } here /# = 2 [2^22 an( ^ ^zz = $ u — 

The maximum likelihood estimator of p of rank k is the same for and 
X i2) normally distributed because the density of X = (X (I ” ， X( 2 )’）’ factors as 

(3) 1) =«(jc (I) ||jl (I) + p( jc ( 2) -i (2) ), 2 zz )«(jr (2) ||i- (2) , ^22) - 

Reduced rank regression has been applied in many disciplines, including 
econometrics, time series analysis, and signal processing. See, for example, 
Johansen (1995) for use of reduced rank regression in estimation of cointe¬ 
gration in economic time series, Tsay and Tiao (1985) and Ahn and Reinsel 
(1988) for applications in stationary processes, and Stoica and Viberg (1996) 
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for utilization in signal processing. In general the estimated reaticed rank 
regression is a better estimator in a regression model than the unrestricted 
estimator 

In Section 13.7 the asymptotic distribution of the reduced rank regression 
estimator is obtained under the assumptions that are sufficient for the 

A A - 

asymptotic normality of the least squares estimator B = - The asymp- 

totic distribution of B k has been obtained by Ryan, Hubert, Carter, Sprague, 
and Parrott (1992), Schmidli (1996), Stoica and Viberg (1996)，and Reinsel 
and Vein (1998) by use of the expected Fisher information on the assumption 
that Z a is normally distributed. Izenman (1975) suggested the term reduced 
rank regression. 

12.8, SIMULTANEOUS EQUATIONS MODELS 
12.8*L The Model 

Inference for structural equation models in econometrics is related to canoni¬ 
cal correlations. The general model is 

( 1 ) B_y,+ IX = M ,， 

where B is G X G and r is G X K. Here y t is composed of G jointly 
dependent variables (endogenous) ， z t is composed of K predetermined 
variables (exogenous and lagged dependent) which are treated as “indepen¬ 
dent” variables ， and u i consists of G unobservable random variables with 

( 2 ) Su t = 0 , Su i vL l — 2 . 

We require B to be nonsingular. This model was initiated by Haavelmo 
(1944) and was developed by Koopmans, Marschak, Hurwicz, Anderson, 
Rubin, Leipnik, et al” 1944-1954, at the Cowles Commission for Research in 
Economics. Each component equation represents the behavior of some group 
(such as consumers or producers) and has economic meaning. 

The set of structural equations (1) can be solved for y ( (because B is 
nonsingular): 

(3) y, = nz, + v„ 
where 

(4) n= — JT 】 r ， v t =B 一 'u t 

with 

(5) 忒 a = o ， = 



516 


CANONICALCORRELATIONS AND CANON I CAL VARIABLES 


say. The equation (3) is called the reduced form of the model. It is a 
multivariate regression model. In principle, it is observable, 

12.8.2. Identification by Specified Zeros 

The structural equation (1) can be multiplied on the left by an arbitrary 
nonsingular matrix. To determine component equations that are economi¬ 
cally meaningful, restrictions must be imposed. For example，in the case of 
demand and supply the equation describing demand miy be distinguished by 
the fact that it includes consumer income and excludes cost of raw materials, 
which is in the supply equation. The exclusion of the latter amounts to 
specifying that its coefficient in the demand equation is 0. 

We consider identification of a structural equation by specifying certain 
coefficients to be 0. It is convenient to treat the first equation. Suppose the 
variables are numbered so that the first G Y jointly dependent variables are 
included in the first equation and the remaining G 2 — G - G x are not and 
the first K x predetermined variables are included and K 2 — K — K { are 
excluded. Then we can partition the coefficient matrices as 

(6) (b = 上 厶丄， 


where the vectors p ， 0 ， 7 ， and 0 have G ：y G 2 , K u and K 2 components, 
respectively* The reduced form is partitioned conformally into Gj and G 2 
sets of rows and K x and K 2 sets of columns: 


⑺ 


n I2 


The relation between B r F, and II can be expressed as 

( 8 ) 

V 0 l = r =- B n=-f p，0 1 卜 n.JJp-n, p-n n 


The upper right-hand corner of (8) yields 

(9) p，n 12 = 0. 


To determine (5 (G, X 1) uniquely except for a constant of proportionality we 
need 


(10) 


rank( II ]2 ) G, - 1. 
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r rhis implies 

(11) K 2 >G y - L 
Addition of G 2 to (11) gives the order condition 

(12) G〗+ 尺 2 > Gi + - 1 = C? - 1 ► 

The number of specified 0’s in an identified equation must be at least equal 
to 1 less than the number of equations (or jointly dependent variables)- 
It can be shown that when B is nonsingular (10) holds if and only if the 
rank of the matrix consisting of the columns of (B D with specified 0’s in the 
first row is G - L 


12.8»3. Estimation of the Reduced Form 

The model (3) is a typical multivariate regression model. The observations 
are 


(13) 



The usual estimators of II and SI (Section 8.2) are 


(14) 

(15) 



These are maximum likelihood estimators if the v t are normal. 
If the z, are exogenous (regardless of normality), then 

(16) S vec P = vec II, ^(vec P) = A~ [ ^ fl, 
where 

T 

(17) A= ZzX 


and vecCd 】，， " ， d m ) = (rf;”.. ， If, furthermore, the i\ are normal, then 
P is normal and Ttl has the Wishart distribution with covariance matrix ft 
and T-K degrees of freedom. 
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( 21 ) 

( 22 ) 






12 


^21 ^21 


A 


22 I = ^22 - ^ 21^11 ^12 


Let y t and O be partitioned into G 丨 and G 2 components: 


(23) 


(24) 


y t 


a 


,(n 


(2) 




a. 


fi 12 

ft” 


12,8,4. Estimation of the Coefficients of an Equation 

First consider the estimation of the vector of coefficients p when K 2 — 
G. — l. Let 


( 18 ) 



be partitioned as n. Then the probability is 1 that rank(P l2 ) = Gj - 1 and 
the equation 


(19) 




has a nontrivial solution that is unique except for a constant of proportional¬ 
ity. This is the maximum likelihood estimator when the disturbance terms are 
normal. 

If K 2 > then the probability is 1 that rank(P 12 ) ^ and (19) has only 
the trivial solution p = 0, which is unsatisfactory. To obtain a suitable 
estimator we find p to minimize pT I2 in som : sense relative to another 
function of 

Let z t be partitioned into sub vectors of and K 2 components ： 
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Now set up the multivariate analysis of variance table for y (l) : 


Source 


zP ) 丄 zP 

Error 


Total 


Sum of Squares 

E jWAW” 

j, /«= 1 

尸 12乂221尸; 2 

L (y ( t l) " p n^ l) - Prf))U (1) - 尸 Mi 2 )) 

Lw ” 


The first term in the table is the (vector) sum of squares of ^ (1) due to the 
effect of z\ l \ The second term is due to the effect of z| 2) beyond the effect of 
z[ l \ The two add to (PAP r ) lu which is the total effect of z n the predeter¬ 


mined variables. 

We propose to find the vector p such that effect of z( 2) and p'W 1 ) beyond 
the effect of z ( t l) is minimized relative to the error sum of squares of 色 ’ 兄 (1) . 
We minimize 


V ^ ) a A A A A A ， 

pn u p pn u p 

where Ttl = ^^\y t y t - This estimator has been called the least vari¬ 

ance ratio estimator. Under normality and based only on the 0 restrictions on 
the coefficients of this single equation, the estimator is maximum likelihood 
and is known as the limited-information maximum likelihood (LIML) estimator 
[Anderson and Rubin (1949)]. 

The algebra of minimizing (25) is to find the smallest root, say v，of 

(26) 1^12^22 1^12 ^ An u | = 0 

and the corresponding vector satisfying 

(27) ^12^22 1^12 P = 

The vector is normalized according to some rule, A frequently used rule is co 
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A 

set one (nonzero) coefficient equal to 1， say the first, = 1. If we write 


(28) 


(29) 


(30) 





= 




n, 


u 


(i) 


w d) 


then (27) can be replaced by the linear equation 

(31) (/ >f 2 S 22 ，/ >,V - ^r,)P*= -(^522,^2- ^ (1) )- 

The first component equation in (27) has been dropped because it is linearly 
dependent on the other equations [because ^ is a root of (26)]. 


12-8.5. Relation to the Linear Functional Relationship 

We now show that the model for the single linear functional relationship 
(q = 1) is identical to the model for structural equations in the special case 
that G 2 = 0 ( 兄⑴ -y t ) and z{ l) ^ 1 (K x = IX Write the two models as 

(32) X aj = -h t/ aj , 1 ，。””， /= l ， “ ，， /c ， 

where 

(33) £v^0 ， 

a = 1 

and 

(34) ^ ~ n, -h W 2 z^ + v t , ,= 1 ，， -- ， 7\ 

where n = (H { n 2 ). The correspondence between the models is p G — G ly 


(35) 

X a, ^y„ 

U aJ ^v,, 

(36) 

(«»/) 

nke T, 

(37) 

屮 ㈠ ft, 

ijl ^ n,^ 


We can write the model for the linear functional relationship with dummy 
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variables. Define 


(38) 


(39) 


Then 


0\ 


S ai 


<-ath position, 




(40) 


P ( jt ， v 卜 …， v , 叫) 卜⑺ 


a — 1 ， . •. ， /z — 1 , 


ct — 1 ， … ， n ， 


where j may be suppressed. Note 

(41) = 十 1 + ... +v,,-i)- 

The correspondence is 


(42) 

1 « 

, 2 (1) 

% 


(43) 

(JL 

^ » * * * 


<*■» n 2 , 

(44) 

1 e 

义， 

n - \ 


(45) 

fi ( V |， …， V ,i-|)= 

Oh p’n 2 = o. 




Let P = ( 尸 1 P 2 ). In terms of the statistics we have the correspondence 


(46) \i = x^y, 

(47) v a =x a -x^P 2 . 

The effect matrix is 


n 

(48) H-k £ {x a -x){x a -x)' ^P 2 A 22l P' 2 , 
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and the error matrix is 

(49) g= £ E -x a )(^ -ra = £ (x- Pz t ){y t -Pz t y. 

a = 1 j = 1 t = 1 

Then the estimator B of the linear functional relationship for q — 1 is 
identical to the LI ML estimator [Anderson (195 lb), (1976) ，（ 1984a)]. 


12.8.6. Asymptotic Theory as 7 1 co 

We shall find the limiting distribution of yjT (p* 一 p*) defined by (28) and 
(31) by showing that p* is asymptotically equivalent to 

(50) PireLS = "^ ( *^22 • i ^*2 ) 1 ^22 ^22 * 1 P 12 • 

This derivation is essentially the same as that given in Anderson and Rubin 
(1950) except for notation. The estimator defined by (50)，known as the two 
stage least squares (TSLS) estimator, is an approximation to the LIML 
estimator obtained by dropping the terms vllfj and ⑴ from (31). Let 
= P*L,ML . We assume the conditions for }/T(P - 11) having a limiting 
normal distribution. (See Theorem 8.11.L) 

Lemma 12.8.1. Suppose (l/T)A ~^A° y a positive definite matrix, as T ^ 
x. Then v= O p (l/T\ where v is the smallest root of (26). 

Proof. Let P l2 = }/T (P 12 - I1 12 X Then because p’Il 12 = 0 
m、_ P'[ni 2 + + ( 1 / 厅)户 12] P 

( 3 丄 ) A 一 A 

P^nP P^nP 

„ P ^12^22 -12 P ^ Q ( 

…— rpfinp ——"⑺’ 

. P ^12^22*1^12^ P ^ 12 ^22 • I ^12 P 

nn --—--<- 7 - 

P pn u p P^uP 


Since 

(52) 


the lemma follows. ■ 
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The import of Lemma 12.8J is that the difference between the LIML 
estimator and the TSLS estimator is O p (l /T). We have 


(53) PtiML - Ptsls 


~ ( ^[2 ^22 > 1 ^1*') ^22 • 1 Pl2 


^ {^ 12 ^ 22*)^12 — ) ( ^\2 ^22 *! P \2 " 

[(^1*2^22-1^12) 1 ~ (d$22 ， ld_ ^^ll) ^12^22 ^\P\? 


■*" ( ^12 ^22 * I ^12 — ^^ll) ^^(1) 

口 — v i ^I2^22*\^n) 1 ^11(^1*2^22*1 ^*2 ^ ) ^I* ^22 • 1 Pl2 

+ ^(^ 1 * 2 ^ 22 * 1^12 ^ V ^n) ^( 1 ) 

= O p (v) = O p (y). 

Consider 


(54) / I2 +/»^P* = /»； 2 P=/1 2 - 2 1 . 1 : = d Y ： z^u Ui 

/-I / 


where z^^z^-A^A^z^. Thus A/»i 2 + ^*2 P*) = /尸; 2 P = 0 and 

(55) S{p\ 2 + ^ 2 ^*)(/, 2 + ^P*)，= 5P ( 尸 ; 2 P )、 

Note that P^sls — P* = —( 尸 iH 2 .!d) 1 P\ 2 ^ 22 • 1 ■^ > i ， 2 P an< ^ (P’，0) 兄 + 
{y'^)z t = u u . 

Theorem 12.8.1. Under the conditions of Theorem 8.11.1 

(56) ^(PtiML - P*) - ^[o,c7 n (n 12 nn， 12 )-r. 

Proof. The theorem follows from (55)，S 22>1 -> and JP l2 上 II I2 , 


Because of the correspondence between the LIML estimator and the 
maximum likelihood estimator for the linear functional relationship as out¬ 
lined in Section 12.7,5, this asymptotic theory can be translated for the latter. 
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Suppose the single linear functional relationship is written as 

/ 

(57) 0=PX = (1 p*’）Z «= 

\ lt f 


where 

(58) 




Let ai (<-> K) be fixed, and let the number of replications k ^ (correspond¬ 
ing to 7y/C — a for fixed K). Let a 2 = (5^(5. 

Since n i2 4 22 in' 12 corresponds to (5* here has the approxi¬ 

mate distribution 


(59) N 

Although Anderson and Rubin (1950) showed that and ^d> (1) could 
be dropped from (31) defining P*iml and hence that was asymptoti¬ 
cally equivalent to they did not explicitly propose [As part of 

the Cowles Commission program, Chernoff and Divinsky (1953) developed a 
computational program of p L1ML .] The TSLS estimator was proposed by 
Basmann (1957) and Thcil (I960, It corresponds in the linear functional 
relationship setup to ordinary least squares on the first coordinate. If some 
other coefficient of (5 were set equal to one, the minimization would be in 
the direction of that coordinate. 

Consider the general linear functional relationship when the error covari¬ 
ance matrix is unknown and there are replications. Constrain B to be 


\ a<= 1 


(60) B*) t 

Partition 


(61) 




V 


( 1 ) 


V 


( 2 ) 


Then the least squares estimator of B* is 

(62) - E ( 却 )—i ⑴ )« 2) —f ： ( 巧 )— 无 (2) )0L 2) -i (2) ) 

<»*- I <lf— I 
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For n fixed and fc oo and B* and 

n 1 

(63) }/k\ec(B^~ B*)^N 0, £< 2 〜又 2), ®B^B '. 

\ cy* I / 

[See Anderson (1984b).] It was shown by Anderson (1951c) that ihe q 
smallest sample roots are of such a probability order that the maximum 
likelihood estimator is asymptotically equivalent, that is, the limiting distribu¬ 
tion of /ic vec(BM L _B*) is the right-hand side of (63). 


12.8.7. Other Asymptotic Theory 


In terms of the linear functional relationship it may be moie natural to 
consider n — ① and k fixed. When fc = 1 and the error covariance matrix is 
cf 2 I p , Gleser (1981) has given the asymptotic theory. For the simultaneous 
equations modd, the corresponding conditions are that /C 2 — T — 尤 ， and 
K 2 /T approaches a positive limit Kunitomo (1980) has given an asymptotic 
expansion of the distribution in the case of /? == 2 and m^q - l. 

When n — °°，the least squares estimator (i,e M minimizing the sum of 
squares of the residuals in one fixed direction) is not consistent; the LIML 
and TSLS estimators are not asymptotically equivalent. 

12,8.8. Distributions of Estimators 

Econometricians have studied intensively the distributions of TSLS and 
LIML estimator, particularly in the case of two endogenous variables, 

Bcact distributions have been given by Basmann (1961),(1963), Richardson 
(1968), Saw a (1969)，Mariano and Sawa (1972)，Phillips (1980), and Anderson 
and Sawa (1982). These have not been very informative because they are 
usually given in terms of infinite series the properties of which are unknown 
or irrelevant. 

A more useful approach is by approximating the distributions. Asymptotic 
expansions of distributions have been made by Sargan and Mikhail (1971), 
Anderson and Sawa (1973), Anderson (1974), Kunitomo (1980), and others. 
Phillips (1982) studied the Pade approach. See also Anderson (1977). 

Tables of the distributions of the TSLS and LIML estimators in the case 
of two endogenous variables have been given by Anderson and Sawa 
(1977),(1979)，and Anderson, Kunitomo, and Sawa (1983a). 

Anderson, Kunitomo, and Sawa (1983b) graphed densities of the maxi¬ 
mum likelihood estimator and the least squares estimator (minimizing in one 
direction) for the linear functional relationship (Section 12.6) for the case 
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p ■= 2 % m — q •= 1, ^ = cr 2 屯 o and for various values of /3, n, and 
(64) 5 2 == -iy ^ 

l 

PROBLEMS 

12.1. (Sec, 12-2) Letz a = Z\ a = 1, a = 1 ， …， n, and P = p. Verify that a (1) = 2 -1 p. 
Relate this result to the discriminant function (Chapter 6 ). 

12.2. (Sec. 12.2) Prove that the roots of (14) are real, 

12.3. (Sec. 12.2) 

(a) Let = 

U = V -7*X (2 \ <SU 2 = 1 = SV 2 ^ where a and y are vectors 

Show that choosing a and 7 to maximize <SW is equivalent to choosing 
a and 7 to minimize the generalized variance of (U V\ 

(b) Let X 1 X (2), X (3)t \ ^X-0, 



%! 


Si,] 

S XX' = 2 == 


2 22 

^23 


1^, 

^32 

^33 j 


U = ot , X( 1 ) ， V^y f X (Zi \ \V= p r X (3) , SU 2 = <SV 2 -SW 2 ^\. Consider 
finding nP to minimize the generalized variance of ((/, V^W). Show 
that this minimum is invariant with respect to transformations = 
^ 0 . 

(c) By using such transformations, transform £ into the simplest possible 
form, 

(d) In the case of X (i) consisting of two components, reduce the problem (of 
minimizing the generalized variance) to its simplest form. 

(e) In this case give the derivative equations. 

If) Show that the minimum generalized variance is 1 if and only if 2 12 - 0, 
2 13 = 0, 2^23 = 0. {Noie: This extension of the notion of canonical variates 
does not lend itself to a “nice” explicit treatment.) 

12.4. (Sec 12.2) Let 

x (,) =^z + r (,) , 

A f(5) -BZ + y (2) , 
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where Y^\ 2 are independent with mean zero and covariance matrices I 
with appropriate dimensionalities. Let A^(a [y ^.,a k \ B = (b ly ., ty b k \ and 
suppose that A'A, B f B are diagonal with positive diagonal elements. Show 
that the canonical variables for nonzero canonical correlations are propor¬ 
tional to Obtain the canonical aDrrelation coefficients and ap¬ 

propriate normalizing coefficients for the canonical variables. 

12.5. (Sec. 12.2) Let \ 艺入 2 艺 … > > 0 be the positive roots of (14)，where 2 u 

and 2 22 are qXq nonsingular matrices. 

(a) What is the rank of 2 12 ? 

(b) Write as the determinant of a rational function of 2 U ， 2 I2 , 2 2l ， 

and 2 22 ^ Justify your answer. 

(c) If k q ^ 1, what is the rank of 

^ ^21 

12.6. (Sec. 12,2) Let 2 U = (1 +ge /) e' p ^ l 22 = (\-h)l p3 +he p e' pi , S 12 == 

ke pi e r p2 y where 一 l/(p] — 0 <g < 1 ， —l/(p 2 - 1) </i < l» and k is suitably 
restricted Find the canonical correlations and variates. What is the appropri¬ 
ate restriction on fe? 

12.7. (Sec, 12.3) Find the canonical correlations and canonical variates between 
the first two variables and the last three in Problem 4.42. 

12.8. (Sec. 123) Prove directly the sample analog of Theorem 12*2.1, 

12.9. (Sec. 12.3) Prove that A^(/ -h 1) -> Aj and a (/ + 1) -> ot ⑴ if a(0) is such that 
a f (0)2 u a (n #0. [ Him : Vr 2 2n 2 I2 222 X 2i = A A 2 A~ *.] 

12,10, (Sec. 12.6) Prove (9) ，（ 10)，and (11). 

12-11- Let 之 A 2 之…之 be the roots of 12 1 - A2 2 l =0, where 2, and S 2 are 
qXq positive definite covariance matrices, 

(a) What does Aj — — 1 imply about the relationship of X x and S 2 ? 

(b) What does > 1 imply about the relationships of the ellipsoids 
=c and l JC = c? 

(c) What does A t > 1 and < 1 imply about the relationships of the ellip¬ 
soids — c and 'jc = c? 

12.12, (Sec. 12,4) For q = 2 express the criterion (2) of Section 9.5 in terms of 
canonical correlations. 

12.13. Find the canonical correlations for the data in Problem 9.11. 



CHAPTER 13 


The Distributions of Characteristic 
Roots and Vectors 


13 丄 INTRODUCTION 

In this chapter we find the distribution of the sample principal component 
vectors and their sample variances when all population variances are 1 
(Section 13.3). We also find the distribution of the sample canonical correla¬ 
tions and one set of canonical vectors when the two sets of original variates 
are independent This second distribution will be shown to be equivalent to 
the distribution of roots and vectors obtained in the next section，The 
distribution of the roots is particularly of interest because many invariant 
tests are functions of these roots. For example, invariant tests of the general 
linear hypothesis (Section 8,6) depend on the sample only through the roots 
of the determinantal equation 

(1) I (Pm ~^)A U . 2 (^ -Pty- win| = 0. 

If the hypothesis is true, the roots have the distribution given in Theorem 
13.2.2 or 13.2.3. Thus the significance level of any invariant test of the 
general linear hypothesis can be obtained from the distribution derived in the 
next section. If the test criterion is one of the ordered roots (e.g” the largest 
root), then the desired distribution is a marginal distribution of the joint 
distribution of roots. 

The limiting distributions of the roots are obtained under fairly general 
conditions. These are needed to obtain other limiting distributions, such as 
the distribution of the criterion for testing that the smallest variances of 
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principal components are equal. Some limiting distributions are obtained for 
elliptically contoured distributions. 

13.2. THE CASE OF TWO WISHART MATRICES 
13.2.1. The Transformation 

Let us consider A* and B* (p Xp) distributed independently according to 
^(2, m) and ^(2, n) respectively (m, n > p). We shall call the roots of 

(1) \A*-IB*\ =0 

the characteristic roots of A* in the metric of B* and the vectors satisfying 

(2) (A* -lB*)x* =0 

the characteristic vectors of A* in the metric of B* . In this section we shall 
consider the distribution of these roots and vectors. Later it will be shown 
that the squares of canonical correlation coefficients have this distribution if 
the population canonical correlations are all zero. 

First we shall transform A* and B* so that the distributions do not 
involve an arbitrary matrix Let C be a matrix such that CSC' = I. Let 

(3) A = CA*C\ B^CB*C . 

Then A and B are independently distributed according to W{I, m) and 
W{I, n) respectively (Section 7.3.3). Since 

U-/BI = \CA*C' ~ICB*C \ 

= \C(A* -IB*)C-\ = \C\ -| /B*|-|C '|， 

the roots of ( 1 ) are the roots of 

(4) \A~IB\ =0. 

The corresponding vectors satisfying 

(5) (4-ifl)x = 0 
satisfy 

(6) d^C~ l {A-lB)x 

= C~\CA*C' - lCB*C')x 
=(A* 


Thus the vectors x* are the vectors C'x. 
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It will be convenient to consider ihc roots of 


(7) \A^f{A^B)\^0 
and the vectors y satisfying 

(8) [A —f{A +B)] - 0. 

The latter equation can be written 

(9) 0 (A-fA-fB)y = [(1 -f)A-fB]y. 


Since the probability that /= 1 (i.e., that | ~B\ =0) is 0, the above equation 
is 


( 10 ) 


A - 




f 


B — 0. 


Thus the roots of (4) are related to the roots of (7) by l : = //(l ~f) or 
/= //(l +/、， and the vectors satisfying (5) are equal (or proportional) to 
those satisfying (8). 

We now consider finding the distribution of the roots and vectors satisfy¬ 
ing (7) and (8). Let the roots be ordered /, >f 2 > ••• >f p > 0 since the 
probability of two roots being equal is 0 [Okamoto (1973)]. Let 


( 11 ) 


7, o o’ 

o f 2 … o 

f = . - 




0 


0 




Suppose the corresponding vector solutions of (8) normalized by 
(12) y'{A+B)y = I 


arc , y r These vectors must satisfy 
(13) y ： {A+B) yi = 0, 

because 乂 =/; 乂 （4 +B)y j and y[Ay ] =/, 兄 ’（4 +B)y j7 and this can be only 
if (13) holds ' 

Let the p xp matrix Y be 


( 14 ) 


Y = (: Vi ， … ， A)- 
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Equation (8) can be summarized as 


(15) AY= (A+B)YF, 
and (12) and (13) give 

(16) Y\A+B)Y = I. 
From (15) we have 


(17) YAY= Y'(A+B)YF = F. 

Multiplication of (16) and (17) on ihe left by (y’) _l and on the right by Y~ l 
gives 


(18) 

Now let 广 1 = E. Then 


4 = (r)~ l ir- 5 . 


A +B = ££, 

(19) A = E'FE, 

B = E'(I~F)E. 


We now consider the joint distribution of E and F. From (19) we see that 
E and F define A and B uniquely. From (7) and (11) and the ordering 
/i > … > f p we see that A and B define F uniquely. Equations (8) for /=/, 
and (12) define y t uniquely except for multiplication by — 1 (i.e., replacing 
by 一 ％)• Since YE = I, this means that E is defined uniquely except that rows 
of E can be multiplied by — 1. To remove this indeterminacy we require that 
e n ^ 0. (The probability that e tJ = 0 is 0.) Thus E and F are uniquely defined 
in terms of A and B. 


13.2.2. The Jacobian 


To find the density of E and F we substitute in the density of A and B 
according to (19) and multiply by the Jacobian of the transformation. We 
devote this subsection to finding the Jacobian 

B{E^F) * 



Since the transformation from A and B to A and G = A + B has Jacobian 
unity，we shall find 


( 21 ) 


d{A,G) 


d(A, B) 

d{E,F) 


d(E,F) 
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First we notice that if x a =f a (yu • • - ， y n \ a = 1， . • • ， n，is a one-to-one 
transformation, the Jacobian is the determinant of the linear transformation 

( 22 ) = E d yp, 

where dx a and dy^ are only formally differentials (i*e., we write these as a 
mnemonic device). If / a (夕 !，■•■， h) is a polynomial, then df a /dy^ is the 
coefficient of in the expansion of / a ( 夕 【 +>f ， … ，夕 n +}*) [in fact the 
coefficient in the expansion of f a (y u ... s ^ +|5 ..., y n )l 

The elements of A and G are polynomials in E and F. Thus the derivative of 
an element of A is the coefficient of an element of E* and F* in the 
expansion of (£ + + F*XE + E*) and the derivative of an element of 

G is the coefficient of an element of E* and F* in the expansion of 
(£+£*)'(£ + £*)• Thus the Jacobian of the transformation from A,G 
to E, F is the determinant of the linear transformation 

(23) dA = (dEyFE^ E f (dF) E E 1 F(dE), 

(24) dG = (d£：yE + E ，（ dE). 

Since A and G (dA and dG) are symmetric, only the functionally indepen¬ 
dent component equations above are used. 

Multiply (23) and (24) on the left by £ # ~ l and on the right by E ' 1 to 
obtain 

(25) E^ { (dA)E' { =E^ l (dEyF + dF + F(dE)E^\ 

(26) E f ~\dG)E^ x =E t ~ l (dE) t + (dE)E- x 

It should be kept in mind that (23) and (24) are now considered as a linear 
transformation without regard to how the equations were obtained. 

Let 


(27) 

E'- l (dA)E~ l 

= dAy 

(28) 

E'^(dG)E-' 

= dG, 

(29) 

(dE) E' 1 

^dW, 

Then 



(30) 

dA = (dWyF + dF 

^F(dW) y 

(31) 

dG = dW' + dW. 
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The linear transformation from dE, dF to dA,dG is considered as the linear 
transformtion from dE,dF to dW, dF with determinant \E' l \ p = \E\ 
(because each row of dE is transformed by £"*), followed by the linear 
transformation from dW, dF to dA, dG 、 followed by the linear transformation 
from dA,dG to dA = E , (dA)E 9 dG - E , (dG)E with determinant \E\ P + 1 * 
\E\ p4，x (from Section 7.3.3); and the determinant of the linear transformation 
from dEydF to dA y dG is the product of the determinants of the three 
component transformations. The transformation (30). (31) is written in com¬ 
ponents as 

da u ^df { + 2f t (Iw tn 


(32) 


da l) =J]dw )l +f l dw ir 
d E" = 2dw", 


< <.l 


啦，广山 ^ ，十，” 1 < J - 

The determinant is 



(34) 


dw n ••- dw lp dw 2i ••- dw lp … 












534 


THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


and 


(35) 



… dw pl 

—2 … ^ W P 2 '• 

- dw p.p 


^i: 


—f2 

… 0 

0 … 0 ! 

• . 1 

1 0 


da lp 


0 

… f p 

二 * 1 

0 … 0 1 

1 

! 0 

1 


da :i 


一— 

0 

… 0 

h …0 [■ 

^ 0 

N = 

* , 



- 

* * \ 





0 

… 0 

0 … f P \ 

| 0 




♦ 

1 

1 

L. _ _ L 

"| 

1 




0 

… 0 

0 … 0 1 

1 

i 

Then 







(36) 




| 肿 - 閘 = n(/, 





The determinant of the linear transformation (23) ， (24) is 

( 37 ) isr p i 五 rMrwn (/,-/,)=2i£r 2 n (/,-//)- 

i<) ，•</ 

Theorem 13.2^1. The Jacobian of the transformation (19) is the absolute 
ualuc of (37). 

13-2-3. The Joint Distribution of the Matrix E and the Roots 
The joint density of A and B is 

(38) w(A.\I, m)w(B\I, n) = C y\A\^ m ~ p - l) \B\^ n -^ :) e- ^ tT{A+B) , 
where 

(39) [2 一 n”)r;(>)]—\ 

Therefore the joint density of E and F is 

(40) C { \E'FE\= im ~^ l) \E'^I-F)E\ i - in ~ l， ~ l) 

■e- iu e ， e 2> , \E'E\ itp+2) n(// ~f])- 
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Since |£ # F£| = !E，| • |F| • |£| = ^T\^ { f i \E , E\ and \E , (l-F)E\ 

- I/-FI -lE^I - n / Li(l -/ l 0|£ / ^|, the density of E and F is 

(41) 

p p , 

2^C l \E T E\^ (m ^ n ~f ,) e~ l2{rE ， E Ylf^ m ~ p ~ l) Yl( 1 n (•/:，•//)• 

i-i i-i 1 i<j 

Clearly, E and F are statistically independent because the density factors 
into a function of E and a function of F. To determine the marginal 
densities we have only to find the two normalizing constants (the product of 
which is 2^C X \ 

Let us evaluate 


(42) 


2^ f\E'E\ 士 ― 


where the integration is 0 < e (l < oo, ~oo <e () < oo, ] ^ 1. The value of (42〕is 
unchanged if we let -oo <e it < oo and multiply by Thus (42) is 


(43) (2 ff ) 


y 


\E'E\ 




exp - j Y, e l 


n 命 


if 


Except for the constant (27 t)^ 2 , (43) is a definition of the expectation of the 
|(m + n — p)th power of |£’£| when the e i} have as density the function 
within brackets. This expected value is the + n - p)th moment of the 
generalized variance \E l E\ when E l E has the distribution W(I 7 p), (See 
Section 7.5.) Thus (43) is 


(44) 


r ^ r p[H m + ^)]—— 

(2ff) r,a,) 2 


P) 


Thus the density of E is 


(45) 


2^ m+n ~ 2) tT^Y p \\{m + n)} 


lE'El i^ m+n 'P^e~ i lTE，E 


The density of /, is (41〕divided by (45〕; that is, the density of f t is 

(46) c 2 ) A (i -/, 严 n (/ ; 一力） 

r=l I i</ 



536 


THE DISTRIBUTIONS OF CHARACTERISTIC ROOTS AND VECTORS 


for 0 <f p ^ *** < 1, where 


(47) 


7T^r,[|(m + n)] 

2 " r p (|n)r p am)r p a P ) • 


The density of l t is obtained from (46) by letting 


(48) 


力 - /, + l ， 

we have 

df t 

dT, 

K 

(49) 

f.-fl 

l i - h 


1-/, 

i 


Thus the density of l t is 

(50) c 2 n/,^->fi(/ (+ 1)- 一心 n (d) 

i-i /</ 


forO 幺 / p < ••• <l v 


Theorem 13.2.2. If A and B are distributed independently according to 
VK(X, m) and n) respectively (m>p,n> p\ the joint density of the roots 
of \A—IB\ =0 is (50) where C 2 is defined by (47). 


The joint density of Y can be found from (45) and the fact that the 
Jacobian is |Fl _ 2p . (See Theorem A.4.6 of the Appendix.) 


13.2.4. The Distribution for A Singular 

The matrix A above can be represented as 4 = where the columns of 
W Y (pXm) are independently distributed, each according to N(0,X). We 
now treat the case of m <p. If we let B = G = CC l and W x = CU ， 

then the roots of 


( 51 ) 


0 = M«/{/4+B)|=| W X W[ - /GI 
= \CUV r C l 一 fCC r \ - |Ch|f/f/ y -fl p \-\c\ 
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are the roots of 

(52) \UV -fl p \ =0. 

We shall show that the nonzero roots /, > •■- >f m (these roots being distinct 
with probability 1) are the roots of 

⑸） \UV-fIJ =0. 

For each root 0 of (52) there is a vector x satisfying 

(54) (UU'-fI p )x = a. 

Multiplication by U 1 on the left gives 

(55) d = U'{UU'-fl p )x 

Thus U'x is a characteristic vector of UU' and / is the corresponding root. 

It was shown in Section 8.4 that the density of U - Ul is (for I p — UU' 
positive definite ot I m - U *Ul positive definite) 

(56) I pt -U*Ul\ ¥n *- p ^'\ 

where 〆 =m ， /i* — 〆 一 1 二 n — p — 1， and m* = p. Thus 六 ，…， must be 
distributed according to (46) with p replaced by m, m by p, and n by 
n+m — p ，that is ， 

“7、 + 

• n 严 n(， ‘ 

r = i i<j 

Theorem 13.2.3. If A is distributed as W { W{, where the m columns of W l 
are independent，each distribuied according to N(0, 2 )， m<p, and B is indepen¬ 
dently distributed according to n\ n>p 9 then the density of the nonzero 
roots of [A —f(A + B)\ —0 is giuen by (57). 

These distributions of roots were found independently and at about the 
same time by Fisher (1939), Girshick (1939)，Hsu (1939a), Mood (1951), and 
Roy (1939). The development of the Jacobian in Section 13.2.2 is due mainly 
to Hsu [as reported by Deemer and Olkin (1951)]. 
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133. THE CASE OF ONE NONSINGULAR WISHART MATRIX 


In this section we shall find the distribution of the roots of 

(1) U-//I 

where the matrix A has the distribution W(I y n). It will be observed that the 
variances of the principal components of a sample of n + 1 from are 

1 /n times the roots of (l). We shall find the following theorem useful: 


Theorem 13.3-1. If the symmetric matrix B has a density of the form 
g(/ t ，• • • ，/ p ), where /【>••• > l p are the characteristic roots of B，then the density 
of the roots is 


( 2 ) 




Proof. From Theorem A.2.1 of the Appendix we know that there exists an 
orthogonal matrix C such that 


(3) 


B^C'LC, 

where 


'l x 0 … o' 

(4) 

L- 

0 l 2 … 0 



, 0 0 … L 

i ^ j 


If the /’s are numbered in descending order of magnitude and if c n > 0 ? then 
(with probability 1) the transformation from B to L and C is unique- Let the 
matrix C be given the coordinates c h ...,c p(p _| )/2 , and let the Jacobian of 
the transformation be /(L,C). Then the joint density of L and C is 
l p )f(L, C). To prove the theorem wc must show that 

⑺ / … //(LX) % -dc p(p _ l)/2 

We show this by taking a special case where B = UU 1 and U (p x m>p) 
has the density 


( 6 ) 


77 




r p [i(m + n)] 

—— 


- UU'\^ n ~ p ~ l \ 
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" p hen by Lemma 13.3.1, which will be stated below, B has the density 




r^[|(m+»)] 

r;(^)r» 


\I ^ B\ 


啡 ( 料岣 ] 

r„(im)r,(^) 


l' MBS ] J = 1 


The joint density of L and C is f(L' C) 忘 *(/,， …，/〆 In the preceding section 
we proved that the marginal density of L is (50). Thus 


( 8 ) 卜 … J p )f(L ， C)dC = g*(l l ”..'l p )f … ff(L ， C)dC 

T,{\p) 8 ( 1 ’… ’ A 

This proves (5) and hence the theorem. _ 

The statement above (7) is based on the following lemma: 

Lemma 13-3-1. If the density of Y (p x m) is f(YY f X then the density of 
B^YY 1 is 


r,(H 


The proof of this，like that of Theorem 13.3.1, depends on exhibiting a 
special case; let f{YY r ) — (2tt)' ^ uYy \ then (9) is m). 

Now let Uo find the density of the roots of (1). The density of A is 

⑽ 厂 ― ^ n^|// (n ^-°exp(-lI ： f M /,-) 

1 } 2^r p (kn) _ 2>r；(>) . 


Thus by the theorem we obtain as the density of the roots of A 


(ii) 


~^ n exp (- 江％ /,)n^) 

. — 2^T^n)r p (\p) 
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Theorem 13.3.2. If A (p x p) has the distribution W(I 7 n\ then the charac¬ 
teristic roots Q x >l 2 > ••• >l p >G) have the density (11) over the range where 
the density is not 0. 

Corollary 13.3.1. Let ^ > ••• > v p be the sample variances of the sample 
principal components of a sample of size N = n + 1 from N([L y a 2 I\ Then 
(n/a 2 ^ are distributed with density (11). 

The characteristic vectors of A are uniquely defined (except for multipli¬ 
cation by — 1) with probability 1 by 

( 12 ) (A-lI)y^^ y f y^h 

since the roots are different with probability 1. Let the vectors with 之 0 be 

(13) 卜（乃， …， 〜)• 

Then 

(14) AY^=YL. 

From Section 11.2 we know that 

( 15 ) YY^L 
Multplication of (14) on the right by K' 1 = Y r gives 

(16) A = yLY\ 

Thus V r = C, defined above. 

Now let us consider the joint distribution of L and C. The matrix A has 
the distribution of 

(17) A= tx a X' a , 

a=l 

where the X a are independently distributed, each according to /). Let 

(18) X 卜 WC a , 

where Q is any orthogonal matrix. Then the X* are independently dis¬ 
tributed according to iV(0, /) and 

n 

a* = Y,KK' = (MQ' 


(19) 
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is distributed according to W{1, n). The roots of A* are the roots of ,4; thus 


(20) A*=C**'LC**, 

( 21 ) =/ 


define C** if we require c r ** > 0. Let 


( 22 ) 

Let 


(23) 


C*=CQ'. 


( c u 


0 

J(C*) = 


C 21 

RiT 





with cfj/lcfJ = 1 if c*| = 0. Thus J(C*) is a diagonal matrix; the ith 
diagonal element is 1 if > 0 and is - 1 if < 0. Thus 

(24) C** -7(C*)C* ^J(CQ f )CQ\ 


The distribution of C** is the same as that of C. We now shall show that 
this fact defines the distribution of C. 


Definition 13.3.1. If the random orthogonal matrix E of order p has ci 
distribution such that EQ' has the same distribution for every orthogonal Q, then 
E is said to have the Haar invariant distribution (or normalized measure). 

The definition is possible because it has been proved that there is only one 
distribution with the required invariance property [Halmos (1950)]. It has also 
been shown that this distribution is the only one invariant under multiplica¬ 
tion on the left by an orthogonal matrix (i.e., the distribution of QE is the 
same as that oi EX From this it follows that the probability is l/2 p that E is 
such that e ix ^ 0- This can be seen as follows. Let be the 2 P 

diagonal matrices with elements +1 and — 1. Since the distribution of J,E is 
the same as that of E ，the probability that e n >0 is the same as the 
probability that the elements in the first column of J t E are non negative. 
These events for i = are mutually exclusive and exhaustive (except 

for elements being 0, which have probability 0), and thus the probability of 
any one is \/2 p . 
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The conditional distribution of E given e n > 0 is 2 P times the Haar 
invariant distribution over this part of the space. We shall call it the 
Conditional Haar invariant distribution. 

Lemma 13.3.2. If the orthogonal matrix E has a distribution such that 
e a > 0 and if E** - j(EQ , )EQ r has the same distribution for every orthogonal 
Q y then E has the conditional Haar invariant distribution. 

Proof. Let the space V of orthogonal matrices be partitioned into the 
subspaces V v .,.,V 2P so that ]y { - V v say，where 人 =/ and is the set for 
which e tl > 0. Let be the measure in V x defined by the distribution of E 
assumed in the lemma. The measure of a (measurable) set W in V { is 
defined as Now we want to show that ^ is the Haar 

invariant measure. Let W be any (measurable) set in V v The lemma assumes 
that 2= Pr{£ g = Pr{£** Y.^U! i WQ t 

If U is any (measurable) set in V, then U == n Vj). Since 

K) = (1/2 p )a [ 义 ("n %)]，by the above this is n Thus 

IjlW ) - fiWQ 1 ). Thus /x is invariant and is the conditional invariant 
distribution. ■ 

From the lemma we see that the matrix C has the conditional Haar 
invariant distribution. Since the distribution of C conditional on L is the 
same, C and L are independent. 

Theorem 13.3.3. If C—Y\ where Y — (y v ^ •,y p ) are the normalized char- 
acteristic vectors of A with y u > 0 and where A is distributed according tc 
n), then C has the conditional Haar invariant distribution and C is 
distributed independently of the characteristic roots. 

From the preceding work we can generalize Theorem 13.3,1. 

Theorem 13.3A If the symmetric matrix B has a density of the form 

g(/,. l p \ where l x > ••- > l p are the characteristic roots of B, then the joint 

density of the roots is (2) and the matrix of normalized characteristic vectors Y 
(>^ l( > 0) is independently distributed according to the conditional Haar invariant 
distribution. 

Proof. The density of QBQ \ where QQ r = l, is the same as that of B (for 
the roots are invariant), and therefore the distribution of W is the 
same as that of Y\ Then Theorem 13.3.4 follows from Lemma 13.3.2 ， 圆 

We shall give an application of this theorem to the case where B — B r is 
normally distributed with the (functionally independent) components of B 
independent with means 0 and variances Sb] { = 1 and <^bf } - { (i <j\ 
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Theorem 13,3.5. Let B — B r haue the density 

(25) 7T - 心 +1)/4 2 -|/>- 卜 》 2 . 

Then the characteristic roots /!>••• > l p of B have the density 

(26) 


and the matrix Y of the normalized characteristic vectors ( 夕 > 0) indepen¬ 
dently distributed according to the conditional Haar invariant distributioru 

Proof. Since the characteristic roots of B 2 are and ttB 2 — E/ t 2 , 

the theorem follows directly. ■ 

Corollary X3.3.2. Let nS be distributed according to W(I 9 n\ and define the 
diagonal matrix L and B by S = C'LC, C f C = /, /!>••• >l p , and c n ^0, 
♦•，/?, Then the density of the limiting distribution of I) — D 

diagonal is (26) with l t replaced by d l9 and the matrix C is independently 
distributed according to the conditional Haar measure. 

Proof. The density of the limiting distribution of ^/n(S -I) is (25)，and the 
diagonal elements of D are the characteristic roots of — /) and the 
columns of C r are the characteristic vectors. ■ 


13.4. CANONICAL CORRELATIONS 

The sample canonical correlations were shown in Section 12.3 to be the 
square roots of the roots of 


(i) 

where 


( 2 ) 


^12 ^22^21 一 /^lll = 0. 


^ 7 = t {x^~x^)(x^~x^)\ 

a= 1 


i,j = 1 ，2 , 


and the distribution of 

(3) 


X 


U (2 )J 
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is 2)，where 

(4) 


2 


2 

2 21 2 


12 


22 


From Section 3,3 we know that the distribution of A tj - is the same as that of 


(5) A l} - t Y^ n， > 

at ■= 1 


where n = N — l and 

( 6 ) 



y ⑴、 


i,j = 1 , 2 , 


is distributed according to MO, 2). Let us assume that the dimensionality of 
K (I) , say p” is not greater than the dimensionality of K (2) , say p 2 、Then there 
are p y nonzero roots of (1), say 

⑺ /i >fi> ''' >ff, r 


Now we shall find the distribution of {/J when 
( 8 ) 2〖 2 = 0 . 

For the moment assume {Kj 2) } to be fixed. Then 4 22 is fixed, and 
(9) B=A [2 A^ 

is the matrix of regression coefficients of F (1) on Y {2 \ From Section 43 we 
know that 


( 10 ) A, h2 = £ ( - BY^ ) ( - BY^ y^A u -BA 22 B 


(11) Q ― = Ayi A 2 2^2i 

(p = 0) are independently distributed according to n —p 2 ) and 

p 2 \ respectively. In terms of Q the equation (1) defining / is 

(12) \Q-f(A u . 2 + Q)\-0. 
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The distribution of /,,， Z = 1， ." ， is the distribution of the nonzero roots of 
(12)，and the density is given by (see Section 13.2) 


(13) TT^ 




， [ 士 (n ~P 2 )] r P (^i) r A .(?/ 7 2 ) 




一 1 ) 


( 1 -/,) 


去 (iV-p 3 - Pl —2) 


}ni / ； -/；)• 

i <j 


Since the conditional density (13) does not depend upon Y (2 \ (13) is the 
unconditional density of the squares of the sample canonical correlation 
coefficients of the two sets X^ l) and \ a = 1 ， … ， N. The density (13) also 
holds when the X {2) are actually fixed variate vectors or have any distribu¬ 
tion, so long as X (l) and X {2) are independently distributed and 尤⑴ has a 
multivariate normal distribution. 

In the special case when p { = 1, p 2 二 p — h (13) reduces to 


(14) 


rU(N-i)] 


r[K^-p)]r[4(p-n] 


4(" '”（| 


which is the density of the square of the sample multiple correlation coeffi¬ 
cient between X (l) - 1) and A"( 2 ) (/? 2 〜 1). 


13.5. ASYMPTOTIC DISTRIBUTIONS IN THE CASE OF 
ONE WISHART MATRIX 

13.5.1. All Population Roots Different 

In Section 13.3 we found the density of the diagonal matrix L and the 
orthogonal matrix B defined by S — BLB\ >l p , and b u > 0, i = 

h … ， P ， when nS is distributed according to W{1, n). In this section we find 
the asymptotic distribution of L and B when nS is distributed according to 
WiX, n) and the characteristic roots of X are different. (Corollary 13.3.2 
gave the asymptotic distribution when 2 = /•) 

Theorem 13,5,1, Suppose nS has the distribution n). Define diagonal 
A and L and orthogonal P and B by 


(1) X = PAp\ S=^BLB\ 

A! > A 2 > >/ 2 > >l p9 (3 lt > 0, b h > 0, / Define 

G - 4n{B — p) and diagonal D = 4n{L - A). Then the limiting distribution of 
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D and G is normal with D and G independent，and the diagonal elements of D 
are independent. The diagonal element d x has the limiting distribution A^(0,2 入 f )、 
The couariarice matrix of g t in the limiting distribution of G 二 （ g v ••” g p ) is 

(2) ^( g t ) - E ， -丁 '入卜 、 2 P 爪， 

k 古 I 

where P = (p 卜 ••” P p )、The covariance matrix of g i and g } in the limiting 
distribution is 

(3) 別 U ， g;) — 

(入,一入 /) 

Proof. The matrix nT = nP'SP is distributed according to W^( A, n). Let 

(4) T^YLY f , 

where Y i% orthogonal. In order that (4) determine Y uniquely, we require 
y (/ > 0. Let /n(T- A) = f/ and ^/n(Y -1) — W. Then (4) can be written 

A + ■ , — D 
\n 

which is equivalent to 

(b) U^W\ + D + S.W + -^(WD + W\W' +DW') + -WDW\ 

\n n 

From I^YY' ^[1 + (l/^/n)W][I + (\/^/n)W'], we have 

(7) 0 = 1^ + ^ + ~^=WW', 

v« 

We shall proceed heuristically and justify the method later. If we neglect 
terms of order \/4n and \/n (6) and (7)，we obtain 

(8) U = WA+D-h \W f 7 

(9) o = m 

When we substitute - - W from (9) into (8) and write the result in 
components, we obtain w n = 0, 

(10) d ， u “， f=l，，.、，/?， 

(11) i 争 h i，j = 1，…, p. 


(5) 


A + —j=^ V : 


I+ ^ 
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(Note w (J — ^ w jr ) From Theorem 3.44 we know that in the limiting normal 
distribution of U the functionally independent elements are statistically 
independent with means 0 and variances —2A? and — 

入 ，入广 i ^jf\ Then the limiting distribution of D and W is normal, and 

… ， ㈧ 12 ,㈧ 13 ,… ，州 p are independent with means 0 and variances 
S^T(d { ) ^ 2A^, 1，…，/?， and ^ A,A ; /(A ; - A,) 2 , / = i + l ，， " ， p ， 

/ — 1,. _,/? — 1. Each column of B is 土 the corresponding column of pK; 
since Y^I 9 we have pK 二 p，and with arbitrarily high probability each 
column of B is nearly identical to the corresponding column of pi". Then 
G=v^T(5 — P) has the limiting distribution of — I) = PTf. The 

asymptotic variances and covariances follow. 

Now we justify the limiting distribution of D and W. The equations 
r= YLY 1 and I — YY r and conditions l x > ••- >l p9 y it > 0, i = 1，…，/?， define 
a 1-1 transformation of T to Y y L except for a set of measure 0, The 
transformation from Y 9 L to T is continuously differentiable. The inverse is 
continuously differentiable in a neighborhood of Y^I and L = A, since the 
equations (8) and (9) can be solved uniquely. Hence K, L as a function of T 
satisfies the conditions of Theorem 4,2,3. ■ 


13.5.2* One Root of Higher Multiplicity 

In Section 1L7.3 we used the asymptotic distribution of the q smallest 
sample roots when the q smallest population roots are equal. We shall now 
derive that distribution. Let 


( 12 ) 




where the diagonal elements of the diagonal matrix A, are different and are 
larger than A* (> 0). Let 


(13) 


T = 


^12 

v — 

^11 ^12 

r 一 

L, 0 

^21 ^22 

， r 一 

^2J ^22 

， L — 

0 L 2 


Then T 二 A，which implies L A, Y u Y l2 ^ 0, Y 2l 0, but Y 22 does 
not have a probability limit. However, 4 I q . Let the singular value 

decomposition of Y 22 be EJF ，where J is diagonal and E and F are 
orthogonal Define C 2 = EF ，which is orthogonal. Let U — yfn{I — A) and 
D = yfn{L — \.) be partitioned similarly to T and L. Define W n = 
W^I2 ^ V^"^I25 ^21 = anc * ^22 = V^(^22 ^ ~ \/^ 五 （/ 一 
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Then ⑷ can be written 


(14) 


f /V, 0 ' 

1 






u 22l 


Ip-q 

o' 

1 

_1 




0 



% 

W \, 


’卜 

0 

\ . 1 

'D, 

O' 

- 

j 。 

A*/, 


'1° 


. 

'X- 

q 0 

1 

'K 

^21 


\ 0 

C\ 

卜 X 


^22 

i • 


0 1 

,1 [ 

'D t 

0 


0 



l 0 

C 2 D 2 C f 2 


\ 


f 


\*W n C' 2 ) 

\*w 22 c Zj 


^*C 2 W[ 2 \*C 2 W^ 



where the submatrices of M are sums of products of C 2 , A,, A*/ gJ D k , W kl , 
and 1/ yfn. The orthogonality of Y {I p = YY') implies 


(15) 



[h- q o) 1 



i 

^2.]' 

l 0 7 J ^. 

5 

W n c 2 j 

i c' 2 w; 2 

c 2 w^ 




where the submatrices of N are sums of products of W kl . From (14) and (15) 
we find that 

(16) U 22 = C 2 D 2 C 2 + O p (l/^). 


The limiting distribution of (l/A*)f/ 22 has the density (25) of Section 13.3 
with p replaced by Then the limiting distribution of D 2 and C 2 is the 
distribution of D* and Y* 2 defined by Uj 2 = Y* 2 D* ^* 2 ^ where (1/A*)f/^ has 
the density (25) of Section 13.3. 
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Theorem 13,5.2, Under the conditions of Theorem 13.5.1 and A = 
diag( A 1? the density of the limiting distribution of d p _ q + l ^ is 

(17) 2 卞入、 ) 咖 ’ Wk) e x P |- 士 f ： dAlUd,-^). 

\ i^p~q + \ j f < ) 

To justify the preceding derivation we note that D 2 and Y n are functions 
of U depending on n that converge to the solution of Uf 2 — We 

can use the following theorem given by Anderson (1963a) and due to Rubin. 

Theorem 13.5.3. Let F n (u) be the cumulative distribution function of a 
random matrix U n . Let V n be a matrix-valued function of U^, V n ^f n (u n X and 
Let G n {v) be the {induced) distribution of V n . Suppose F n (u)--*F(u) in every* 
continuity point of F(u) y ana suppose for every continuity point u of /(u), 
f n (u n ) when u n u. Let G(v) be the distribution of the random matrix 

V=-f(U) y where U has the distribution F(u) If the probability of the set of 
discontinuities off(u) according to F(u) is 0, then 

(18) lim G n {v)^G{v) 

n -*oo 

in every continuity point of G(v )、 

The details of verifying that U{n) and 

(19) (D 2 (n), y 22 (n)) =f n {U( nn)) 

satisfy the conditions of the theorem have been given by Anderson (1963a). 


13.6. ASYMPTOTIC DISTRIBUTIONS IN THE CASE OF 
TWO WISHART MATRICES 


13.6.1. All Population Roots Different 

In Section 13.2 we studied the distributions of the roots /, > / 2 > •** >l p of 

(1) |s*-/r*|-o 

and the vectors satisfying 

(2) (S* = 0 
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and x* — 1 when A* = mS* and = nT^ are distributed indepen¬ 
dently according to ^(2, m) and n\ respectively. In this section we 

study the asymptotic distributions of the roots and vectors as n — oo when A* 
and B* are distributed independently according to m) and W^(2 ， n )， 

respectively, and m/n tj> 0* We shall assume that the roots of ■ 

(3) I 中 — 入 2 卜 0 

are distinct, (In Section 13,2 = — = A p - 1.) 

Theorem 13.6.1. Let mS* and nT* be independently distributed according 
to m) and n\ respectively. Let … > (> 0) be the 

roots of (3 )， and let \ be the diagonal matrix with the roots as diagonal elements 
in descending order; let 7 【” •， ， be the solutions to 

(4) (中一 A,2)7 = 0 ， i = l. j) y 

7'27 = 1 , and y h > 0 , and let T = ( 7 " …， ％)_ Le/ ~ 乏 … / p (> 0 ) be the 
roots of (1 )， and let L be the diagonal matrix with the roots as diagonal elements 
in descending order, lei be the solutions to ( 2 ) for l = l i9 i = 1 ， •” ， p ， 

x* T*x* ^ 1, and > 0, and let X* = (xf， …， <)• Define Z* = yfn{X^ - D 
and diagonal D - }/n(L - \), Then the limiting distribution of D and Z* is 
normal with means 0 as n — 00 ， m — 00, and m/n ^ 17 (> OX The asymptotic 
variances and covariances that are not 0 are 


⑸ ^r(^,.) = 2 A?(1 7? +1?) , 

( 6 ) I ； A， ^ fc + ^2 7fc7fc+ h,-7f> 

㈡ 刀 (Ad) 

(7) 冰(屯彳）=入,7,， 

A A：(l + tj) 

(8) 一^ ~~^7;7;， 

17 ( 入厂入 f ) 

Proof. Let 

(9) s =「 s * r ， 


Then mS and nT are distributed independently according to W^(A ， m) and 
W{I, n\ respectively (Section 7,33), Then l l9m .. 9 l p are the roots of 


(10) 


\S-IT\ =0. 
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Let x v ^^x p be the solutions to 

(11) (S -/,r)x = 0 ， f = l ， _" ， p ， 

and x f Tx= 1, and let X= (^ ， … ， xp. Then xf = Fx t and X* =TX except 
possibly for multiplication of columns of X (or X^) by —1_ If Z = /n(X — I) 9 
then Z* = rZ (except possibly for multiplication of columns by —1). 

We shall now find the limiting distribution of D and Z. Let ^/n(S — A) = f/ 
and ^fn{T — I) = V. Then U and V have independent limiting normal distri¬ 
butions with means 0. The functionally independent elements of V and V are 
Statistically independent in the limiting distribution. The variances are Su 2 tl 
^2{n/m)\] — 2 入 f/7j; Su 2 l} = {r/m)\ l \ } ^ i ^j; = 2; 1, 

z_ 句 ‘. 

From the definition of L and X we have SX = TXL, X’TX;l ， and 
X r SX = L_ If we let AT 1 = G，we obtain 

(12) 5-GLG, T^GG. 

We require > 0, i = 1，…， /?• Since S 二 A and r we have L A and 
G Let Then we write (12) as 


/+ ~H' 

\n 

These can be rewritten 

(15) U = D + \H + H'\+ -^(DH + H'D+H'\H) + ~H DH, 

vn n 

(16) V=H + H' + ~H'H. 

\n 

If we neglect the terms of order \/4n and l/n (as in Section 13.5), we 
can write 

(17) U = D + XH + H'\, 

(18) V=H + H', 

(19) U-V\ = D + 






yn ) 
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The diagonal elements of (18) and the components of (19) are 


( 20 ) 

( 21 ) 

( 22 ) 


u n -\ i v ti = d l , 

(A, - 


i 句 . 


The limiting distribution of H and D is normal with means 0, The pairs 
(hij, hjf) of off-diagonal elements of H are independent with variances 


(23) 




A ; (A,+ 77A ; ) 
_ A ;) 2 


i 婪 /， 


and covariances 


(24) 


腐 (〜，〜,-) 


+ Tj) 

刀(入, — A j ) 2 


i 句 •• 


The pairs (d,,h lt ) of diagonal elements of D and H are independent with 
variances (5), 


(25) 

and covariance 

(26) 




h u ) = — A, 


The diagonal elements of D and H are independent of the off-diagonal 
elements of H. 

That the limiting distribution of D and H is normal is justified by 
Theorem 4,2.3. S and T are polynomials in L and G, and their derivatives 
are polynomials and hence continuous. Since the equations (12) with auxiliary 
conditions can be solved uniquely for L and G, the inverse function is also 
continuously differentiable at L = A and G = L By Theorem 4,2.3, 
D ― \^n(L - A.) and i/= yfn(G 一 /) have a limiting normal distribution, In 
turn, X=G^ 1 is continuously differentiable at G = /, and Z = y/n(X-1) 
= yfn{G~ x -I) has the limiting distribution of —H. (Expand \/n {[/ + 
{\/yfn)HY x -/}•) Since G 4/, X^I 9 and x“> 0, i = 1”"，p with proba¬ 
bility approaching 1. Then Z* = 一 D has the limiting distribution of 

rZ, (Since U we have X^ = rx^r and ^ > 0 ， f=l” •，，/?，with 
probability approaching 1.) The asymptotic variances and covariances (6) to 
(8) are obtained from (23) to (26). ■ 
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where S n , T,,, L u and G u are k X k. Then G n ^ I k , G n ^ and G 21 0, 

but G n does not have a probability limit. Instead G' 22 G n A J 尸士 Let the 
singular value decomposition of G 12 be EJF ，where E and F are orthogonal 
and J is diagonal. Let C 2 — EF. 

The limiting distribution of U = v^T(S - A) and K= ^n(T-I) is normal 
with the covariance structure given above (12) with \ +l = … ^ - A*. 
Define D = }/n(L — A), H u = }/n(G u — /), H i2 — yfnG Vj H^ { -■ ^ln G., 4 and 
H u = ^(G 22 -C 2 ) = ^E(J-I p , k )F. Then (13) and (15) are replaced by 

A i 0 1 1 \U U U n 

( 29) 0 X*I p _ k u v u n 


Ik + 7^ H，n Ai + 7T Z), 


7^ H，n 


c，2 + 




Anderson (1989b), has derived the limiting distribution of the characteris¬ 
tic roots and vectors of one sample covariance matrix in the metric of another 
with population roots of arbitrary multiplicities. 

13.6.2. One Root of Higher Multiplicity 

In Section 13.6*1 it was assumed that mS* and nT* were distributed 
independently according to m) and PF(2, n\ respectively, and that the 
roots of — A21 =0 were distinct. In this section we assume that the k 
larger roots are distinct and greater than the p — k smaller roots, which arc 
assumed equal. Let the diagonal matrix A of characteristic roots be A = 
diag( A 1? \*I p _ k \ and let F be a matrix satisfying 

(27) 4>r-srA, r ; sr = /. 

Define S and T by (9) and diagonal L and G by (12). Then S A, T ^ / p . 
and L A. Partition S, T, L, and G as 
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f we neglect the terms of order 1 /yfn and 1 fn 、instead of (1)) we can write 


1/->J 一 /V 〗 Uri ~ A* 


C\H 2X {^I-V { ) c 2 d 2 c 2 


Then v u 2h [n /=1， u ix - \ t u lt ^d n /=1，，••，/:; u !} - v l} X } ^ 
{X t -X } )h lr i_j 、 i ' 卜 h … ， U 22 - \^V 22 ^C 2 D 2 C 2 ; C 2 (t/ 2I 

^ Aj); and ([/ l2 - A*F| 2 )C 2 = (A*/ - A \)H n * The limiting distri¬ 
bution of f/ 22 — A* V 22 is normal with mean 0; ^(u lt - A* u it ) 2 — Sd]- 
2 入 * 2 (1 + / = /: + 1， …， p; and 广入* 〜) 2 = A* 2 (l + i 句’， 

/, j = k l，..，，p. The limiting distribution of D z and C 2 is the distribution 
of D 2 and C 2 defined by U 12 — A* V 22 = C 2 D 2 C 2 where (l/A*)(f/ 22 — V n ) has 
the density of (25) of Section 13.3. 


h + ~r= rH w ~r H \i 

yjn V；7 

y/n ^ in - 


0 A*/, 


lDi 0 [ ^ I Hy 2 

+ 7n 0 C' 2 DX 2 + A*qff 21 \*C' 2 H 22 


丄卜 V 、 

7T h\ 2 \ I \*h' 22 c z 


■ O ’ 


and (14) and (16) are replaced by 
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13.7. ASYMPTOTIC DISTRIBUTION IN A REGRESSION MODEL 
13.7.1. Both Sets of Variates Stochastic 

The sample canonical correlations !、，…，! p , and vectors a 6t pi , and 
Yi”“ ， T P2 are defined in Section 1Z3, The set 7i， …， and l\ ， … ， l pi are 
defined by 

(1) S 2l S[\ l S l2 y^S n yl 2 y y f S 22 y = l. 

The asymptotic distribution of these quantities was given by Anderson 
(1999a) when X {7), y h^s a normal distribution and also when X (1 ) 

is normally distributed with a linear function of nonstochastic X {2) as 
expected value» We shall now find the asymptotic distribution when X has a 
normal distribution. The model in regression form is 

(2) x (1) - PLf (2) + Z, 

where X( 2 ) and Z are independently normally distributed with expected 
values = 0 and = 0 and covariances *#X( 2) X( 2) = 2 22 ? m = 2 ZZ 

(^X (2) Z ; -0X Then ^X (1) = 0 and = 2 n = S zz + PS 22 P ； and 

Inference is based on a sample of X of n observations. 
First we transform to canonical variables U = r J X^ 2 \ and 

W= A ! Z m Then (1) is transformed to 

(3) U=®V^W 7 

where ^UU 1 = l Uif ^I p]} cfVV 1 ^l yy ^I p ^ 虎 UV 1 

= (A,0) = X 5 ^WW 1 \\ and iVW f - 0, [See (33) to (37) 

and (45) of Section 122,] Let the sample covariance matrices be S uv - 
A'5 U A', = A'5 12 F, and S yy = Let the sample vectors consti¬ 

tute F _1 f = r— 1 ( 今 ! ，…， U Then H satisfies 

(4) S vu SaS uv H = S vv Hm HS yy H^I pi , 

where A + 二 diagfA】， ，•” 入 P| ， 0”“ ， 0); if p x <p 2 , there are p 2 _p x O’sin \ + , 

We have S ua 4 / p] , S vv ^I pi , S uy ^X. Then A = diag (乂 .A p] ) ^ A. 

Let 

H n H„ 

(5) H=(H t ,H 2 )= n ' 2 , 

n 2\ n 22 ^ 
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where is p { Xp y and H 2 i is (P 2 — P\) 乂 （ P 2 一 P) The first p : columns 
ot (4) are 


( 6 ) 


~ J/| A 2 ; 


the last p 2 — p x columns of (4) are S uy H 2 = 0, Then 二 jT ^， ff 12 A 0, and 
H 2 i 0, but the probability limit of ( 4 ) only implies H 22^22 ^ 

the singular value decomposition of ■^^22 be ^^22 = EJP « 

Define S^ v = yfn{S ul} S^ y = ]/n(S y y~ I pi \ Sf, y = ]fn{S uy -\), 

H* = and A* = [v/n(A - A.),0], where / {p：) = ( / p| ,0)\ Then 

expansion of (6) yields 


⑺ 


八 , + ~7n S * u 


7 - + 


A + ~7n S ^ y 


r (. P> ) + — H * 






\2 




J 


From (7) we obtain 


(8) s^Xi {pi) -X's^Tir^ + X'S* uv i iPi) + X'Xh* 

=S* yy I {p 0 X 2 + H *\ 2 + 2/ (pi) AA* +o p (l) 


From H[SyyH l =/ pj we derive 

(9) H^+Hr^-S^+o^l). 

In terms of partitioned matrices (8) is 


( 10 ) 


45^1/4 + A5^1 ； - 5*1 ； A 2 
S*l'\-S * 2 y '\ 2 


2\\* +H^\ 2 ~ 
甩人 2 


+ o p (l). 


The lower submatrix equation [(p 2 — p!) XpJ of (10) is 

(11) H* /V = 5 ^' - +o p (l)= + o p {\) = (S^ Y+o p (\). 
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A diagonal element of the upper submatrix equation of (10) is 

(12) Af = i E ~ A ,) _ 2 A ,( u ?； - l) - ^A,(y, 2 n - 1)] +o p (l). 

» ^ a~ I 

The right-hand side of (12) is the expansion of the sample correlation 
coefficient of u la and u fa . See Section 4.Z3. The limiting distribution of 入 f is 
A/[0，(1 - A?) 2 ], 

The (/, y)th component of H* } in (10) is 

(13) 

(Aj-A?)/z* = ^ + A,u, a i； ；a - \,\jU ia u Ja - 

v n 

i 丰 j i.j= 


The asymptotic covariance of (A ; 2 - Af)/z* and (Af - Ap/r* is 

⑽ f(i-L 入卜刪 ） (i-a?)(i-a?)(a ? + a ；)' 

(1 - A”(l - A ; 2 )(Af + Aj) (1 - A?)(a ； + Aj - 2A;A；) 

The pair (hjj, hj } ) is uncorrelated with other pairs. 

Suppose p x =p 2 . Then r* = = Tff*, Let r = f = 

(今 1 ，… ，今 p )• Then 7 * = E, p i i 7 f ^* ? where /z*, i ^ j 9 is obtained from (13) and 
h% from (9). We obtain 


(15) 价 7 胸’相 + 


(16) n^(y ] -y ] )(y l -y l )' 


(1 - Aj)(l - A ； )(A; + A^) 

(入卜灯 


7 , 7 ；^ 


Anderson (1999a)，has also given the asymptotic covariances of ql } and of 7 ^ 
and a；. Note that /z* depends linearly on (u ta ,v la ) and that the pairs 
(u ia ,v ia ) and (uj a7 Vj a X i ^}\ are uncorrelated. The covariances (14) do not 
depend on (U ， V) being normal. 

Now suppose that the rank of r , 2 is k <p v Define ff, = H^) f as 
the first k columns of H satisfying (4)，and define \, = diag ( 乂 , ， … ，又人)， 
Then satisfies ⑹， and Hf satisfies ( 8 ), (9), (10)，and ( 11 ). Then A* is 
given by ( 12 ) for i- 
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The last p { - k columns of (4) are 

07) m 十？ ， 2 卜(1), 

Hence 

〔 18) ' 邱 =-5^ 2 C 2+0p (l) = -S^pC 2 +o p (l). 

13,7.2. One Set of Variates Stochastic and the Other Set Nonstochastic 

Now consider the case that X (2) in (2) is nonstochastic, where SZ a = 0 and 
/Z a Z ； = 2 77 . We observe X — x ]y .^ 9 x ir We assume 

nr * 1 

and is nonsingular. Then 

(20) .S. n = A E = P5 22 P + S zl ^' + PS 22 + S 2Z 4 P S 22 P ( + S zz , 

a = I 

(21) L =4 E ㈣) , =PS 22 + s Z2 二 px 22 . 

a= 1 

Define A, a, and y by solutions to 

一 A ( 艺 ZZ + P^P ) P^22 a _ q 

S22P — as 2 ? y 

(23) «'(2 ZZ + P5 22 p^)a=l, 7 ^ 7 - 1 . 

We shall first assume p, ― p 2 and \ { > *> A pi > 0, Then (22) and (23) 

and a fl > 0 define 

(24) diag( ，， 》,, A"^) =/V rt ， = \ ny (Yj”.* ， *Vpi) = F rt * 

Let A! n X il \ v a = ■- r^ a , o ： =l ”.，，《， W= A fJ Z, ® = A 、 p(rp _l = A ， 
H - IV 1 r. Then H and \ satisfy (4). Then 5,/jy = /, 

(25) S v ©5^^ + S wy = © + S wv ^ 0, 

(26} Smj = ©5^^© + ©5^^ -f 5 1 j/® -f S ww —> /» 


( 22 ) 
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Then (4) can be written 

(27) (A. + S l/W ,)(A. 2 + A.5 l/W , + S W , P ,A. + 5 W , W ,) _1 ( 4 + 5^,,,)^ = ffA 2 . 

Note that S yw 0, S ww — A. 2 , and hence H I, A 4 A.. 

Let S* w = ]/nS yw , S^ w = ^/n[S ww - (/- A. 2 )]. Then (27) leads to 

(28) (/-A. 2 )S^A.+ 45*^(/-A. 2 ) -\S*r W \+ \ 2 H* 

^H*\. 2 + 2\\* +o p (l). 

A diagonal term of (28) gives 

(29) A* = (1 - A?)-^ E £ [jv, 2 a -(l - A?)] +o p {\). 

Since 

(30) J 2 1 2 “l-A ?)， 

(31) (1-A?)] 2 = 2(l-A?) 2 

under the assumption that W is normally distributed, the limiting distribution 
of yfn{\ t - A ( ) is N[0，（1 一 A?) 2 (l - ^A?)]. Note that this variance is smaller 
than in the case of X (2) stochastic ， 

From (28) we find 

(32) 

(A; - 入 f)% = ~ E [(1 - A?)y, a H- ;a A ; + A,.»v. ai ；. a (l -Aj) - A,»v ia H- ； a A ; ] 

Then 

(33) (A 卜 S'(hjjf (l - A?)(l - A?)(A? - A? - A? A?). 

The equation H'SyyH = I implies H f H = /, leading to H* = -H* 1 + o p (l), 
that is, hfj = - h* + o p (lX 

Now suppose that the rank of p is k <p t =p 2 . Then A = diag( A 卜 0 )， 
where Aj = diagCAi,.*. A k X Let F = (r i 7 F 2 X where T x has k columns and 
r 2 haspj - k columns. Define the partition (5) to be made into k and p x - k 
rows and columns. The probability limit of (4) implies H n ^ I k , H [2 0, 

H 2i -^0, and Let the singular value decomposition of H 22 be 
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EJF ， where 7 is a diagonal matrix of order p 广 k and E and F are 
orthogonal matrices of order p, 一 fc. Define C 2 = EF, The expansion of (4) 
in terms of 5* ^ = ~\fii (^ /V y ^ y 一 /V ) ， w 。一 

(/- /V 2 )]， HA = H* 2 ^ ^H {29 = ^H 219 and //* = 

- C 2 ) = ^E(J — l)F yields 

/VjSJ^(/ - /V、） + (/— A. 2 ,)S*|^A., — A.j S^ ] y A.I A.jS^Cj 
^vw ^) 0 

0 

The ith diagonal term of (34) is (29) for f = 1 ， … ，人 , The i, ;th element of the 
upper left-hand submatrix is (32) for i 丰 '} and ?•，;•=],•"，/：- Two other 
submatrix equations of (34) are 


(35) 

/VL -S#C 2 +"/l )， 

(36) 

H 2 *,/V,=S^+ 0 / ,(l). 

The equation I 

= H , S yy H=H H yields 

(37) 

H* + H* Hf 2 +H^ ； C 2 、 

r , L> c =0 + o / >( J ) 


The off-diagonal submatrices of (37) agree with (35) and (36). 



13.7.3. Reduced Rank Regression Estimator 

When the rank of p is specified to be k (<p l X the maximum likelihood 
estimator of P is 

(38) ^=S, ： f,f；. 

See Section 12.7, In terms of (3) the reduced-rank regression estimator of 0 
is 

(39) 


Suppose X {2) is stochastic and 0 = dia^(©,,0) = diag(A.,,0). We define 
0fc = V«(0 fc -®), H* ^ - I {k) ), S^ y - ^(S uy - A.), and Spy 

= ^(Syy-Il From H\S VV H^I we find Hf, + *； = -5 ^}； +o p (l). 
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From (39) and ⑼ we obtain 

s!,]) + \ l {Hr, + Hr ] , ) 


(40) 




o* II 
^wy 

o* 21 

^\yy 


c*12 

°wv 


21 


o 


+ 0 / 1 ) 




We can compare with the maximum likelihood estimator unrestricted by 
a rank condition 0 = S uv Syy, Then 


(41) 


0 * = ^(&-&)=(S Uy -&S Vy )Sy 

e* 11 o* 12 


^wy ^ ⑴ 


o* 21 o* 22 

°wv ^wy 


乂⑴， 


since S VV ^L The effect of the rank restriction is to replace the lower 
right-hand submatrix of S^, y by 0 (the parameter value). 

Since = we have vec S% v = {\/\!n )E^|(V ； ® V^). 

Because V a and W a are independent, 

(42) ^ vec S^ y {\ec S* wv )' 

=® A 2 ) = diag(/-A 2 ,...,/ - A 2 ), 

where A = diag( A l5 0) and I — A 2 = diag (/ — A 2 |, /). On the other hand 

W^\. 


(43) 


vec 


♦ vec +E 




o J 


C 2 )， 


心⑴ 


7T 


E 


' ® 

K 

F a ( 2 ) ® 

Vji)) 
l o J 


+ o p {l). 


where F a = (F a ⑴ ’ ， F a ( 2 )’)' and W a = (W^ ) \W^ 2) ')'. Then 
(44) 

0 

o i 


i vec 0^ (vec ) ； 


p\~ k 




Wo 

0 0 


- diag (/ P1 - A 2 ,/ a . - . I k - A^O) 

where there are k blocks of / pi ~ A 2 and p 广 k blocks of diag (/ A . - A : 1? 0) 
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In tKe original coordinate system 

(45) vec(4 - P) - vec[( A') _l (© k - 0 )叫 

=[r ® - 0) 

- (r i ,r 2 )®2 zz (A I ,A 2 )(/-A 2 ) _1 ]vec(©-©). 

From (44) and (45) we obtain 

(46) ^ vec n(^B k — P)[vec(B fc -p)] f 

^ |(r,ri) ® x zz a (/ p - a 2 ) 

+ r 2 r; 2 ZZ a 1 ( 4 - a 2 丨） A r | 

=[ru 0 s zz ] + [r 2 r; 0 s zz 〜(’灸 一 /V') a\s zz 

— <s> s zz - (r 2 r; 0 s zz a 2 a’ 2 s zz ) ， 

If we define O = ^ YX T\ - A, A ( (/- A ^)" 1 and II = T h then p = ftir. 
We have 

(47) = S zz - S zz A 2 A 、 2 zZ ， 

(48) 

Thus (46) can be written 

(49) /vec 4* (vec Bf .)'® S zz - [ S A - n( IT 5^ Hr 1 1 叫 

® 2 zz -ft(ft , 2zin)~ 1 n , . 

Theorem 13.7.1. Let (X i]) \ X (2), ) ; , a= 1,..., /i, be observations on the 
random vector x n with mean 0 and covariance matrix S. Let P = ^ 12 ^ 22 l - Let 
the columns of f\ satisfy (1) and y u > 0, Suppose that X (1) — PX" (2) =Z is 
independent of X [2) . Then the limiting distribution of vec B* = v^Tvec ( 毛 一 p )， 
with B k — is normal with mean 0 and covariance matrix (46) or (49), 

Note that p = Iin ; = for arbitrary nonsingular M; how¬ 

ever, (47) and (48) are invariant with respect to the transforir ation ft ftilf 
and n — flAT" 1 , Thus (49) holds for any factorization P = Iin ; , 
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The limiting distribution of only depends on V / nS Z 2 5^" 2 t - 

{A , )^ l S% v SyyT , and hence holds under the same conditions as the asymp- 
totic normality of the least squares estimator B. 

Now suppose that = x^\ o：= 1 ， … ， w，is nonstochastic and that (19) 
holds. The model is (2); in the transformed coordinates [t/= A 、 x ( 1 ) ， v a - 
K n Z, 0 - A^pCr ,；) -1 = A] the model is (3). satis¬ 

fies (34) and (37)，Again (39) holds. Further, (42) and (43) hold with V a = v a 
nonstochastic. 

Corollary 13.7 丄 Let j^ 2 )，..., be a set of vectors such that (19) holds. 
Let ^ + a = l,.. where z a is an observation on a random 
vector Z with SZ — 0 and €ZZ' = 2 zz . Suppose p has rank k. Then the 
limiting distribution of 4nvtc{B k - p) & normal with mean 0 and covariance 
(46) or (49). 

13.8. ELLIPTICALLY CONTOURED DISTRIBUTIONS 
13.8.1. Observations Elliptically Contoured 

Let be N observations on a random vector X with density 

⑴ I 平 「W(x- V)，屮 -'(x-v)]， 

where is a positive definite matrix, R 2 = (x - vyVf~ ] (x - v), and 
SR 2 < oo. Define K-p /[{ SR 2 ) 2 { /? -f 2)] - 1. Then SX - v -(x and 
S{X- vX^- v ) 1 = ( ^R 2 /p)^ - 2. Define x and S as the sample mean 
and covariance matrix. Define the orthogonal matrices p and B and the 
diagonal matrices A. and L by 

(2) 2 = S = BLB\ 

A! > … 〉 A p ， l x > >l py (3 n > 0, b iX >0, 1,...,/?. As in Section 13.5,1 ， 

define T = = YLY\ where Y = p'B is orthogonal and y n > 0. Then 

^r=p 2 ：p = A. 

The limiting covariances of vec(S 一 2) and }/N vec(r — A.) are 

(3) lim NS vec(5 - 2)[vec(5 - 2 )] ； 

N -* 00 

=(K + 1)(/〆 + 反叩 ) （ 2 ： ® 2) + K vec 2 (vec 2)', 

(4) lim NS 1 vec(T- A.)[vec(r- A.)]* 

= (« + 1 )(^ 2 + K pp) + K vec/ p (vec/ p y. 
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In terms of components = A, 5" and 

(5) lim N S{t is - A 1 5 iy )(r fc/ - k k 8 kl ) 

=(K + 1)(A,-f A, + KA,-A fc S^ 8 kl . 

Let }/N(T- \) = U, \/N(L-A) = D, and ^N{Y-I p )^W. The set 
u,u pp are asymptotically independent of the set (u l2 ,. p ); the 
covariances u u , i ^/, are mutually independent with variances (k -f 1)A ; Ay ； 
the variance of u i( == d f converges to (3/c + 2)A?; the covariance of u if = d l 
and u kk - d k ,i_k, converges to kA,, The limiting distribution of w^, i /, 
is the limiting distribution of u i} / {^ - A,), Thus the w, p i < ]\ are asymptoti¬ 
cally mutually independent with = (k + 1)A ; -A ; /(A ; - A；) 2 * These vari¬ 
ances and covariances for the normal case hold for k = 0 , 

Theorem 13.8.1. Define diagonal A and L and orthogonal p and B by (2), 
〜 > A p , 心 > …〉 / p ， 之 0, \ 之 0, f - 1,..p. Define G = y/N(B - 
P) and diagonal D - 4N (L - A). Then the limiting distribution of G and D is 
normal with G and D independent The variance of d { is (2 + 3/cM^ and the 
covariance of d f and d k is kA, A 卜 The covariance of g i is 

(6) ^(^) = (1+k) f ： A 六 2 p fc p；. 

k=\ ( \ 一 
k_i 

The covariance matrix of g { and gj is 

⑺ ^(g„g ； )--(l+K) 7 - Aj - P ; P；, i 半 i. 

d A )) 

Proof, The proof is the same as for Theorem 13.5.1 except that (4) is used 
instead of (4) with k = 0. ■ 

In Section 11.7.3 we used the asymptotic distribution of the smallest q 
sample roots when the smallest q population roots are equal. Let A = 
diag( Aj, \*I q \ where the diagonal elements of (diagonal) are different 
and are larger than A*. As before, let U = }/N(T - A), and let t/ 22 be the 
lower right-hand q'Xq submatrix of U. Let D 2 and Y 22 be the lower 
right-hand qXq submatrices of D and Y, It was shown in Section 13.5.2 that 
^22 = ^22 ^2 ^22 ^⑴. 
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The criterion for testing the null hypothesis = ••- = A ;) is 


( 8 ) 


n p / 

1 1 i*=p —q +l*i 


In Section 11.7.3 it was shown that 一 N times the logarithm of ( 8 ) has the 
limiting distribution of 

(9) 击卜14十「1 / 22 ) 2 

P 

/ >- 9+1 

，</ 

The term L i<J u^ j has the limiting distribution of (1 + K)A* : x ^- n/1 ，The 
limiting distribution of (u p _ t p _ ^ n ,.,,u pp ) is normal with mean 0 and 
covariance matrix A* 2 [2(1 + K)I q + kec'IA* 2 . The limiting distribution of 
[Ewg —(Ew f ;) 2 / 9 ]A * 2 is 2(1 + K)\^ 2 Xg. v Hence, the limiting distribution of 

(9) is the distribution of (1 + k)^ 2 (<? + 1)/2 _!. 

We are also interested in the characteristic roots and vectors of one 
covariance matrix in the metric of another covariance matrix. 




Theorem 13.8.2* Let S# be the sample covariance matrix of a sample of she 
M from (1 )， and let T* be the sample covariance matrix of a sample of size N 
from (1) with replaced by 2. Let A be the diagonal matrix witli \ > … > 
(>Q) as the diagonal elements，where A p ,. • ， ar^ the roots of — A2| =0. 
Let r = ( 77 p ) be the matrix with the solution of ( 市 —A, 2)7 = 0 , 
7^7 = 1, and 7 [( > 0. Let X* = (x*,..., x*) and diagonal L* consist of the 
solutions to 

(10) (S* -/T*)x* -0, 

x* 'T*x* = 1, and x \ >0. As M co, N ^ go, M/N ^ tj, the limiting distribu¬ 
tion of Z* — yfN (X* — r) and diagonal D* = y/N (L — A) is normal with the 
following covariances: 

(11) ，，⑷ = (2 + 3 k)A? 宁， 

( 12 ) ^(d i ,d J ) = K\ i \ ] ^, 
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(13) 质 .(: ,)=(i +k) A :( 〜 + 7 fc 7l- + 2+ ^ K y,yU 

A = | ( \ - A,) 4 

k 7=1 

(14) ,^( (d r z,) = 2 V K A, 7 ; . 

A,A (1 + rj) k 

(15) .cA (；；.；-) = - (1 + k) - ^7 ； 7 ； + T7,7；, 叫， 

(16) ^ ( d r^j) = ^\y r 

Proof. Transform S* and to S = T , S*r and T=r'T*r, and 2 

to A—r^F and / = T^T, and X* to X= = G _I . Let D = 

v/NU - AX H^y[N{G-l\ [/ =\/^(5-A), and V= 4n{T-I\ The 
matrices V and V and D and H have limiting normal distributions; they are 
related by (20), (21). nnd (22) of Section 13.6. From there and ihc covariances 
of the limiting distributions we derive (11) to (16). ■ 


13.8.2. Ellipiically Contoured Matrix Distributions 

Let Y {p XN) have the density g(tv YY f ), Then A - YY f has the density 
(Lemma 13.3.1) 


(17) 


丌>、 1/11 ;(•、■，—" 




Lei A = BLB\ wlicrc L is diagonal with diagonal elements /, > ••- > l p and 
B is orthogonal with b l} >0. Since ^(tr A) = g(Ts(L\ I t X the density of 1 、 ,… ， l p 
is (Theorem 13.3.4) 




7rVg(Er ， |/,)n, <; (/,-/；) 

i^) 


and the matrix B is independently distributed according to the conditional 
Haar invariant distribution. 

Suppose 7* (p X m) and Z* (pXn) have the density 

+z*'^f~ l Z*)] (m,n>p). 

Let C be a matrix such that — /. Then Y — CY* and Z ― CZ* have 

the density g[tr(YY f -f ZZO]. Let A* B* A^YY\ and 

B - ZZ\ The roots of 1/1* — IB*\ = 0 are the roots of (A — IB\ = 0. Let the 
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roots of \A —f(A +B)| = 0 be j\> … >f p , and let F — diag(/ 1? .. 
Define E (p Xp) by A B = E’E ， and A =E r FE, and e n > 0, i = 


(19) 


Theorem 13.8.3. The matrices E and F are independent. The density of F is 


咖 )r P Gm)r p (k) j 

the density of E is 




-I) 




2PT (-p)7T^ p(n + m ~ p) 

(20) 2 一 兑人 u( …)] 一〜咖叫. 


In the development in Section 13.2 the observations Y, Z have the density 

(21) (2tt •厂 ^ p(n+m) e -5tr(ry+z*z) = (2 tt) ^ XtU+b \ 

and in Section 13.7 g[tr(yT -f Z r Z)] = g[tr(/4 +JS)】. The distribution of the 
roots does not depend on the form of g(0 ； the distribution of E depends 
only on E E The algebra in Section 13.2 carries over to this more 

general case. 


PROBLEMS 

13 . 1 . (See. 13.2) Prove Theorem 13.2.1 for p = 2 by calculating the Jacobian 
directly. 

13 . 2 . (Sec. 13.2) Prove Theorem 13.3.2 for p = 2 directly by representing the 
orthogonal matrix C in terms of the cosine and sine of an angle. 

13J, (Sec. 13.2) Consider the distribution of the roots of \A — lB\ =0 when A and 
B are of order two and are distributed according to and W(X y n), 

respectively. 

(a) Find the distribution of the larger root, 

(b) Find the distribution of the smaller root. 

(c) Find the distribution of the sum of the roots. 

13 . 4 . (Sec. 13.2) Prove that the Jacobian \d{G^A)/d{E,F)\ is Hifrf) times a 
function of E by showing that the Jacobian vanishes for j } = f } - and that its 
degree in / ; is the same as that of 0(/, — 力 ). 

13.5. (Sec. 13.3) Give the Haar invariant distribution explicitly for the 2 X 2 orthog¬ 
onal matrix represented in terms of the cosine and sine of an angle. 
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13 . 6 , (Sec. 13.3) Let A and B be distributed according to m) and WiX^n) 
respectively. Let l x > '•* > l p be the roots of \A — IB\ =0 and m A > *** > m p 
be the roots of \A — mX\ = 0. Find the distribution of the m’s from that of the 
/’s by letting oo, 

13 . 7 . (Sec. 13.3) Prove Lemma 13,3.1 in as much detail as Theorem 13.3,1. 


13,8. Let A be distributed according to W{^ y n), In case of p = 2 find the distribu¬ 
tion of the characteristic roots of A, [Hint: Transform so that £ goes into a 
diagonal matrix.] 


13*9, From the result in Problem 13.6 find the distribution of the sphericity criterion 
(when the null hypothesis is not true). 


13J0. (Sec. 13.3) Show that X (p Xn) has the density f x (X'X) if and only if T has 
the density 


2P 7r P n / 2 

YJ^U.trfx{TT'), 


where T is the lower triangular matrix with positive diagonal elements such 
that TT f [Srivastava and Khatri (1979)], [Hint ： Compare Lemma 13,3.1 

with Corollary 7.Z1,] 

13*11. (Sec. 13.5.2) In the case that the covariance matrix is (12) find the limiting 
distribution of W u , W X2y and W 2V 


13,12- (Sec. 13,3) Prove (6) of Section 12.4. 



CHAPTER 14 


Factor Analysis 


14.1. INTRODUCTION 

Factor analysis is based on a model in which the observed vector is parti¬ 
tioned into an unobserved systematic part and an unobserved error part. The 
components of the error vector are considered as uncorrelated or indepen¬ 
dent, while the systematic part is taken as a linear combination of a relatively 
small number of unobserved factor variables. The analysis separates the 
effects of the factors, which are of basic interest、from the errors. From 
another point of view the analysis gives a description or explanation of the 
interdependence of a set of variables in terms of the factors without regard to 
tli3 observed variability* This approach is to be compared with principal 
component analysis，which describes or “explains” the variability observed. 
Factor analysis was developed originally for the analysis of scores on mental 
tests; however, the methods are useful in a much wider range of situations ， 
for example，analyzing sets of tests of attitudes, sets of physical measure¬ 
ments, and sets of economic quantities. When a battery of tests is given to a 
group of individuals, it is observed that the score of an individual on a given 
test is more related to his scores on other tests than to the scores of other 
individuals on the other tests; that is，usually the scores for any particular 
individual are interrelated to some degree. This interrelation is “explained” 
by considering a test score of an individual as made up of a part which is 
peculiar to this particular test (called error) and a part which is a function of 
more fundamental quantities called scores of primary abilities or factor scores 、 
Since they enter several test scores, it is their effect that connects the various 
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test scores. Roughly, the idea is that a person who is more intelligent in some 
respects will do better on many tests than someone who is less intelligent. 

The model for factor analysis is defined and discussed in Section 142, 
Maximum likelihood estimators of the parameters are derived in the case 
that the factor scores and errors are normally distributed, and a test that the 
model fits is developed. The large-sample distribution theory is given for the 
estimators and test criterion (Section 14.3). Maximum likelihood estimators 
for fixed factors do not exist, but alternative estimation procedures are 
suggested (Section 14.4). Some aspects of interpretation are treated in 
Section 14.5. The maximum likelihood estimators are derived when the 
factors are normal and identification is effected by specified zero loadings. 
Finally the estimation of factor scores is considered, Anderson (1984a) 
discusses the relationship of factor analysis to principal components and 
linear functional and structural relationships. 


14.2. THE MODEL 

14.2 丄 Definition of the Model 

Let the observable vector X be written as 

( 1 ) 

where X, t/, and |jl are column vectors of p components, / is a column 
vector of m (<p) components, and A. is a p X m matrix. We assume that U 
is distributed independently of / and with mean = 0 and covariance 
matrix ^UU , = which is diagonal. The vector / will be treated alterna¬ 
tively as a random vector and as a vector of parameters that varies from 
observation to observation. 

In terms of mental tests each component of X is a score on a test or 
battery of tests. The corresponding component of jx is the average score of 
this test in the population. The components of / are the scores of the mental 
factors; linear combinations of these enter into the test scores. The coeffi¬ 
cients of these linear combinations are the elements of A, and these are 
called factor loadings ，Sometimes the elements of / are called common 
factors because they are common to several different tests; in the first 
presentation of this kind of model [Spearman (1904)] / consisted of one 
component and was termed the general factor, A component of U is the part 
of the test score not “explained” by the common factors. This is considered as 
made up of the error of measurement in the test plus a specific factor, having 
to do only with this particular test. Since in our model (with one set of 
observations on each individual) we cannot distinguish between these two 
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components of the coordinate of t/, we shall simply term the element of U 
the error of measurement. 

The specification of a given component of X is similar to that in regres¬ 
sion theory (or analysis of variance) in that it is a linear combination of other 
variables. Here, however, /， which plays the role of the independent variable, 
is not observed. 

We can distinguish between two kinds of models. In one we consider the 
vector / to be a random vector, and in the other we consider / to be a vector 
of nonrandom quantities that varies from one individual to another. In the 
second case, it is more accurate to write 尤 a = 八 / a + f/ + jju The nonrandom 
factor score vector may seem a better description of the systematic part, but 
it poses problems of inference because the likelihood function may not have 
a maximum. In principle, the model with random factors is appropriate when 
different samples consist of different individuals; the nonrandom factor 
model is suitable when the specific individuals involved and not just the 
structure are of interest. 

When f is taken as random，we assume Sf — 0. (Otherwise, SX — 
A #/ + jjl ，and (jl can be redefined to absorb A .) Let <^ff r — Our 
analysis will be made in terms of first and second moments. Usually, we shall 
consider / and U to have normal distributions. If / is not random, then 
f -f a for the ath individual. Then we shall assume usually (1 /N)Y.^ msX f a = 0 
and (l/N)L N a=i f a f^^. 

There is a fundamental indeterminacy in this model. Let /= Cf^ (/* = 
C 一 V") and A* == AC, where C is a nonsingular m X m matrix. Then (1) can 
be written as 

( 2 ) X=\^f* +[/ + +• 

When / is random ， Z/*/*'= C— 1 0(C— 1 )'= O*; when / is nonrandom, 
(1 ’ = O*. The model with A and / is equivalent to the model 

with A* and /*; that is，by observing X we cannot distinguish between these 
two models. 

Some of the indeterminacy in the model can be eliminated by requiring 
that Sff if / is random，or T.^ =l f a f r a ^Nl if / is not random. In this 
case the factors are said to be orthogonal, if O is not diagonal，the factors 
are said to be oblique. When we assume $ then - C~ l (C~ l Y =1 

(l = CC r \ The indeterminacy is equivalent to multiplication by an orthogonal 
matrix; this is called the problem of rotation. Requiring that O be diagonal 
means that the components of / are independently distributed when / is 
assumed normal. This has an appeal to psychologists because one idea of 
common mental factors is (by definition) that they are independent or 
uncorrelated quantities. 
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A crucial assumption is that the components of V are uncorrelated. Our 
viewpoint is that the errors of observation and the specific factors are by 
definition uncorrelated. That is, the interrelationships of the test scores are 
caused by the common factors, and that is what we want to investigate. There 
is another point of view on factor analysis that is fundamentally quite 
different; that is, that the common factors are supposed to explain or account 
for as much of the variance of the test scores as possible. To follow this point 
of view, we should use a different model. 

A geometric picture helps the intuition. Consider a p-dimensional space. 
The columns of \ can be considered as m vectors in this space. They span 
some m-dimensional subspace ； in fact, they can be considered as coordinate 
axes in the m-dimensional space, and / can be considered as coordinates of 
a point in that space referred to this particular axis system. This subspace is 
called the factor space ‘ Multiplying A. on the right by a matrix corresponds to 
taking a new set of coordinate axes in the factor space. 

If the factors are random, the covariance matrix of the observed X is 

(3) 2= [/)^-A4>/V ; 

If the factors are orthogonal {^ff f —1\ ihen (3) is 

(4) 2 = 

If / and U are normal, a]] the information about the structure comes from 
(3) [or (4)] and \i. 

14.2.2 - Identification 

Given a covariance matrix 1 and a number m of factors, we can ask whether 
there exist a triplet A., 4> positive definite, and 平 positive definite and 
diagonal to satisfy (3); if so, is the triplet unique? Since any triplet can be 
transformed into an equivnlcnt structure AC, C^ 1 and 屯 ， we can 

put m 2 independent conditions on X and 中 to rule out this indeterminacy. 
The number of com pone nm in the observable X and the number of condi- 
tions (for uniqueness) is \p {p 4* 1) +m 2 ; the numbers of parameters in A, 
屯 ， and ^ are pm, \m{m -f 1), and p, respectively. If the excess of observed 
quantities and conditions over number of parameters, namely, 士 [(p - m) 2 
一 p — m\ is positive, we can expect a problem of existence but can anticipate 
uniqueness if a set of parameters does exist. If the excess Is negative, we can 
expect existence but possibly not uniqueness ； if the excess is 0, we can hope 
for both existence and uniqueness (or at least a finite number of solutions). 
The question of existence of a solution is whether there exists a diagonal 
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matrix with nonnegative diagonal entries such that 2 — is positive 
semidefinite of rank m. Anderson and Rubin (1956) include most of the 
known results on this problem. 

If a solution exists and is unique, the model is said to be identified As 
noted above, some m 2 conditions have to be put on \ and to eliminate a 
transformation A* = AC and <I>* We have referred above to 

the condition which forces a transformation C to be orthogonal. 

[There are 去 m(m + 1) component equations in = /•] For some purposes, it 
is convenient to add the restrictions that 

(5) r = /VH 

is diagonal. If the diagonal elements of T are ordered and different (y n > 
722 〉 … 〉 ％ wn)，A ls uniquely determined. Alternative conditions are that 
the first m rows of A form a lower triangular matrix. A generalization of this 
condition is to require that the first m rows o( B \ form a lower triangular 
matrix, where B is given in advance. (This condition is implied by the 
so-called centroid method,) 

Simple Structure 

These are conditions proposed by Thurstone (1947, p. 335) for choosing a 
matrix out of the class AC that will have particular psychological meaning. If 
= 0, then the ath factor does not enter into the fth test. The general idea 
of simple structure is that many tests should not depend on all the factors 
when the factors have real psychological meaning. This suggests that, given a 
A, one should consider all rotations, that is, all matrices AC where C is 
orthogonal, and choose the one giving most 0 coefficients. This matrix can be 
considered as giving the simplest structure and presumably the one with most 
meaningful psychological interpretation. It should be remembered that the 
psychologist can construct his or her tests so that they depend on the 
assumed factors in different ways. 

The positions of the 0，s are not chosen in advance, but rotations C are 
tried until a A is found satisfying these conditions. It is not clear that these 
conditions effect identification. Reiers0l (1950) modified Thurstone’s condi¬ 
tions so that there is only one rotation that satisfies the conditions, thus 
effecting identification. 

Zero Elements in Specified Positions 

Here we consider a set of conditions that requires of the investigator more 
a priori information. He or she must know that some particular tests do not 
depend on some specific factois. In this case, the conditions are that X Ja = 0 
for specified pairs (i, a); that is, that the ath factor does not affect the ith 
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test score. Then we do not assume that S y ff f =/. These conditions are 
similar to some used in econometric models. The coefficients of the ath 
column are identified except for multiplication by a scale factor if (a) there 
are at least m — 1 zero elements in that column and if (b) the rank of is 
m - L where A (£ ° is the matrix composed of the rows containing the 
assigned O’sin the ath column with those assigned 0’s deleted (i ， e.，the ath 
column deleted), (See Problem 14.1.) The multiplication of a column by a 
scale constant can be eliminated by a normalization, such as <f> aa = 1 or 
A rQ - 1 for some i for each a. If <f> aa = 1, then 中 is a 

correlation matrix. 

It will be seen that there are m normalizations and a minimum of 
m{m - 1) zero conditions. This is equal to the number of elements of C If 
there are more than m — 1 zero elements specified in one or more columns 
of A. then there may be more conditions than are required to take out the 
indeterminacy in AC; in this case the conditions may restrict AO A 1 . 

As an example, consider the model 


|JL + 


入 31 口 + ^-32^ 


for the scores on five tests, where u and a are measures of verba] and 
arithmetic ability^ The first two tests are specified to depend only on verbal 
ability while the last two tests depend only on arithmetic ability，The 
normalizations put verbal ability into the scale of the first test and arithmetic 
ability into the scale of the fifth test. 

Koopmans and Reiers0l (1950), Anderson and Rubin (1956)，and Howe 
(1955) suggested the use of preassigned 0’s for identification and developed 
maximum likelihood estimation under normality for this case. [See also 
Lawley (1958).] Joreskog (1969) called factor analysis under these identifica¬ 
tion conditions confirmatory factor analysis; with arbitrary conditions or with 
rotation to simple structure, it has been called exploratory factor analysis. 
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Other Conditions 

A convenient set of conditions is to require the upper square submatrix of /V 
to be the identity，This assumes that the upper square matrix without this 
condition is nonsingular. In fact, if /V* = (A* \\ f ) r is an arbitrary pXm 
matrix with A* square and nonsingular, then /V = ^ (/ satis¬ 

fies the condition, (This specification of the leading m X m submatrix of /V 
as I m is a convenient identification condition and does not imply any 
substantive meaning,) 


14.2.3. Units of Measurement 

We have considered factor analysis methods applied to covariance matrices. 
In many cases the unit of measurerrent of each component of X is arbitrary. 
For instance, in psychological tests the unit of scoring has no intrinsic 
meaning. 

Changing the units of measurement means multiplying each component of 
X by constant; these constants are not necessarily equal. When a given test 
score is multiplied by a constant, the factor loadings for the test are 
multiplied by the same constant and the error variance is multiplied by 
square of the constant. Suppose DX = 尤 *， where D is a diagonal matrix with 
positive diagonal elements. Then (1) becomes 

(7) JT = + 

where ji* == A* = D\, and U* ^ DU has covariance matrix 

Then 

(8) - — i^) r - + -X*, 

where 2* =D2D, Note that if the identification conditions are ^ = / and 
A/ 少 _1 /V diagonal，then A* satisfies the latter condition. If \ is identified 
by specified 0’s and the normalization is by <f> aa — 1, a= (i,e,，^ is a 

correlation matrix), then A* =DA is similarly identified. (If the normaliza¬ 
tion is 入 /„= 1 for specified i for each a, each column of D\ has to be 
renormalized,) 

A particular diagonal matrix D consists of the reciprocals of the observ¬ 
able standard deviations d u - lThen - D^LD is the correlation 
matrix. 

Wc shall see later that the maximum likelihood estimators with identifica¬ 
tion conditions F diagonal or specified 0’s transform in the above fashion; 

A A 

that is, the transformation x* ^ Dx a , a — 1,..., N, induces A* = DA and 
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14.3. MAXIMUM LIKELIHOOD ESTIMATORS FOR RANDOM 
ORTHOGONAL FACTORS 


143,1. Maximum Likelihood Estimators 

In this section we find the maximum likelihood estimators of the parameters 
when the observations are normally distributed, that is，the factor scores and 
errors are normal [Lawley (1940)], Then 2 = A<l> A’ + 平 . We impose condi¬ 
tions on A and <l> to make them just identified. These do not restrict 
A<t>A r ; it is a positive definite matrix of rank m. For convenience we 
suppose that <l> = / (i.e，，the factors are orthogonal or uncorrelated) and that 
r = A / 'P -1 A is diagonal. Then the likelihood depends on the mean and 
2 = A A f + 屯 . The maximum likelihood estimators of \ and 中 under some 
other conditions effecting just identifiertion [e.g., A — (/ ml are trans¬ 

formations of the maximum likelihood estimators of A under the preceding 
conditions. If x^ 、、 . ， x N are a set of N observations on X, the likelihood 
function for this sample is 

(1) L = (2 7r )-^|5 ； |-^expf-i E (x a -^y^(x a -y,). 

• «=» • 

The maximum likelihood estimator of the mean |x is = (l/Ny£^iX a . 
Let 

N 

(2) A= £ 

«=1 

Next we shall maximize the logarithm of (1) with replaced by pi; this is f 

(3) log27r-^log|2| - 士 tr42 一 】. 

(This is the logarithm of the concentrated likelihood.) From 一 1 = /, we 
obtain for any parameter 6 


(4) 


~dd~ 



^2 


2 


Then the partial derivative of (3) with regard to a diagonal element of 
'P, is —N/2 times 

(5) cr n - E C kJ a^ ik , 

k t j = 0 


f We could add the restriction that the off-diagonal elements of 'A are 0 with Lagrange 

multipliers, but then the Lagrange multipliers become 0 when the derivatives arc set equal to 0, 
Such restrictions do not affect the maximum. 
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where = (a ,J ) and (c, ; ) - C- (1 /N)A t In matrix notation, (5) set equal 
to 0 yields 

(6) diag 1 = diag 2 _l CX _l , 

where diag H indicates the diagonal terms of the matrix H. Equivalently 
diagS 一 K2 - C)2 一 1 = diagO. The derivative of (3) with respect to is -N 
times 

p p 

(7) - H cr kh c hg a s 、)” k = U … r= 1,. 

/ =1 h % g,j=l 

In matrix notation (7) set equal to 0 yields 

( 8 ) 2 

We have 

(9) = AT + A = A(F + /). 

From this we obtain ^P _1 A(r +/) _1 = 2 _1 A. Multiply ⑻ by 1 and use the 
above to obtain 

(10) /V(r+j) = cn 

or 

(11) AT, 

Next we want to show that 1 — 1 一 Cl 一 1 = 2 一 丨 （1 — “ 1 is 

屯一 1(2 - C) V 1 when (8) holds. Multiply the latter by 2 on the left and on 
the right to obtain 

( 12 ) 

2 W 1 (2 - C) 屮 2 = (/V /V， + 屮）屮 - 1 ( 屮 + /V A，- C) 屮 _ 1 ( A /V + 屮） 

=^+AA-C 

because 

(13) A A + A A - C) = A + ArA f - A A 

=a[(/ + r) a- - ah] 

~ 0 

by virtue of (10). Thus 

(14) — C)2 一 * 一 C) 屯一 1 , 
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Then ⑹ is equi\alent to diag 平 — ] (2 — =diagO, Since 平 is diago¬ 

nal. this equation is equivalent to 

(15) diag( 八八 ' + 少 ） =diag C, 

A A 

The estimators \ and ^ are determined by (10), (15), and the requirement 
that \ is diagonal. 

We can multiply (11) on the left by ^ to obtain 

( 16 ) ~^(C - »A) = (^P _ ^A)r, 

which shows that the columns of 市 _ t/V are characteristic vectors of 
kC—^ — / and the corresponding diagonal ele¬ 
ments of T are the characteristic roots. [In fact, the characteristic vectors of 
市 - t 伞 — 一 J are the characteristic vectors of 市 — ^ because 
( 市 — ^ -l)x- yx is equivalent to -C^ - = (1 4 - y)xj The vec¬ 
tors are normalized by ( 市 —& 八） ' （市 — ^A) — A = T. The characteristic 

roots are chosen to maximize the likelihood. To evaluate the maximized 
likelihood function we calculate 


(17) 


trC 2 ' 


tr[c^ 1 - (Ci~'A)A , ^- 1 ] 
tr[C 命 - 1 - A A 


A A A A , AAA . 

tr ( A A ; + 屮 ) 屮 _1 - A A^ -1 
P. 


The third equality follows from (8) multiplied on the left by i ； the fourth 

A ■ 

equality follows from (15) and the fact that ^ is diagonal. Next we find 


(is) 4 ^-=a+/ p -1^1 

= l^'-|r + A,J 

I = i ； — i 


The second equality is | UU r + l ff \ - | U f U -f I f/1 \ for [/ p X m, which is proved 
as in (14) of Section 8.4, From the fact tlvat the characteristic roots of 
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kc — 屯）屯一 * are the roots 7i > 72 > > J p of 0 = | C — I — = 

|C- (1 + 7)^1, 

(19) : = A ㈣). 

[Note that the roots 1 + y, of ^ are positive. The roots of 

伞 - kc 一 屯）伞 are not necessarily positive; usually some will be negative*] 
Then 


( 20 ) 


II，i icin Jtg)S (i + y } ) = \c\ 

nf =1 (i + y r ) = n^sfi + y；) 5 


where 5 is the set of indices corresponding to the roots in f. The logarithm 
of the maximized likelihood function is 


(21) -\pN \og27r-^N log|C| ]og(l + y ; )-^p. 

The largest roots 71 > > y m should be selected for diagonal elements of 

f • Then S 以 {1，• • ，， m}. The logarithm of the concentrated likelihood (3) is a 
function of 2 = 八 A/ + 屯 . This matrix is positive definite for every A and 
every diagonal 屯 that is positive definite; it is also positive definite for some 
diagonal 屯 ’s that are not positive definite. Hence there is not necessarily a 
relative maximum for 平 positive definite. The concentrated likelihood 
function may increase as one or more diagonal elements of ^ approaches 0 . 
In that case the derivative equations may not be satisfied for ^ positive 
definite. 

The equations for the estimators (11) and (15) can be written as polyno¬ 
mial equations [multiplying (11) by 丨伞 |]，but cannot be solved directly. There 
are various iterative procedures for finding a maximum of the likelihood 
function, including steepest descent, Newton-Raphson, scoring (using the 
information matrix), and Fletcher—Powell，[See Lawley and Maxwell (1971 )， 
Appendix II， for a discussion.] 

Since there may not be a relative maximum in the region for which 少 " > 0, 
/= 1， "，，/?，an iterative procedure may define a sequence of values of A and 
屯 that includes < 0 for some indices /. Such negative values are inadmis¬ 
sible because 少 " is interpreted as the variance of an error. One may impose 
the condition that 少 “ > 0, / = 1,..., Then the maximum may occur on the 
boundary (and not all of the derivative equations will be satisfied). For some 
indices i the estimated variance of the error is 0; that is, some test scores are 
exactly linear combinations of factor scores* If the identification conditions 
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^> = / and _I /V diagonal are dropped, we can find a coordinate system 
for the factors such that the test scores with 0 error variance can be 
interpreted as (transformed) factor scores. That interpretation does not seem 
useful. [See Law ley and Maxwell (1971) for further discussion.] 

An alternative to requiring 中 “ to be positive is to require ip,, to be 
bounded away from 0. A possibility is 0) 7 > s<r ti for some small such as 
0.005. Of course, the value of s is arbitrary; increasing e will decrease the 
value of the maximum if the maximum is not in the interior of the restricted 
region, and the derivative equations will not all be satisfied. 

The nature of the concentrated likelihood is such that more than one 
relative maximum may be possible. Which maximum an iterative procedure 
approaches will depend on the initial values，Rubin and Thayer (1982) have 
given an example of three sets of estimates from three different initial 
estimates using the EM algorithm. 

The EM (expectation—maximization) algorithm is a possible computational 
device for maximum likelihood estimation [Dempster, Laird, and Rubin 
(1977)，Rubin and Thayer (1982)]. The idea is to treat the unobservable /’sas 
missing data. Under the assumption that f and U have a joint normal 
distribution, the sufficient statistics are the means and covariances of the X's 
and f’s. The E-step of the algorithm is to obtain the expectation of the 
covariances on the basis of trial values of the parameters. The M-step is to 
maximize the likelihood function on the basis of these covariances; this step 
provides updated values of the parameters. The steps alternate, and the 
procedure usually converges to the maximum likelihood estimators. (See 
Problem 14.3.) 

As noted in Section 14.2, the structure is equivariant and the factor scores 
are invariant under changes in the units of measurement of the observed 
variables X — DX, where D is a diagonal matrix with positive diagonal 
elements and A is identified by A/ 市 _1 A is diagonal. If we let DA = A*, 
D^D ^ 少 *， and DCD ^ C*, then the logarithm of the likelihood function is 
u constant plus a constant times 

(22) -log| 屮 * + W’l-trC* ( 屮 * + W，） -1 

=-log|^ + /V/V，| -trC(^ + /VA ，） _1 -21og|D|. 

A A 

The maximum likelihood estimators of A* and are A* = DA and 
- and A* A* = A+ _1 A is diagonal. That is, the estimated 

factor loadings and error variances are merely changed by the units of 
measurement. 

It is often convenient to use d tl = 1 /y[c^ n so DCD = (r /; ) is made up of 
the sample correlation coefficients. The analysis is independent of the units 
of measurement. This fact is related to the fact that psychological test scores 
do not have natural units. 
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The fact that the factors do not depend on the location and scale [actors is 
one reason for considering factor analysis as an analysis of interdependence, 
[t is convenient to give some rules of ihumh for initial estimates of the 

A - A 

coinmunalities, i Aj ; = 1 - i// r in terms of observed correlations, One rule 
is to use the 把、 “ ，叫 . / + 1r …. /r Another is to use max/" 


14.3,2. Test of the Hypothesis That the Model Fits 

We shall derive the likelihood ratio test that the model fits; that is. that tor a 
specified m the covariance matrix can be written as 2 = ^ + A A ' for some 
diagonal positive definite ^ and some p x m matrix A. The likelihood ratio 
criterion is 




(23) 


max* 人 中 L(n，^+ A AQ _ |C| 

maXp 二 L(n,2) |>jr + a. A| 


^.V 


n (i + 


because the unrestricted maximum likelihood estimator of X is C, tr C(4^ + 
AAr 1= p by (17), and |C|/|i| = + + *)# from (20). The null 


hypothesis is rejected if (23) is too small We can use 一 2 times the logarithm 
of the likelihood ratio criterion: 


p 

(24) - /V [ log{l -f 7；) 

j » m + 1 

and reject the null hypothesis if (24) is too large. 

A * 

If the regularity conditions for 平 and \ to be asymptotically normally 
distributed hold, the limiting distribution of (24) under the null hypothesis is 
X 1 with degrees of freedom 士 [(p — m) 2 — p — m]，which is the number of 
elements of S plus the number of identifying restrictions minus the number 
of parameters in ^ and A. Bartlett (1950) suggested replacing bv^ 
N - {2p + 11)/6 - 2m/3. See also Amemiya and Anderson (1990 )， 

From (15) and the fact that are the characteristic roots of 

♦— kC- 企）命 ] we have 


(25) 0 = tr 4^-=(C-4^- A 

=tr 4^' !(C- +) 命 -—tr 命 -; 人 A ^ ' 
=tr ♦_;(C 一命 ） i _ tr f 

P fti P 

=E t, - E T ； = E 7r 

/ » ] 1 * I i — + l 


t This factor is heuristic. If m = 0. the factor from Chapter 9 .V - (2/7 \\)/b: Barilen 

suggested replacing N and p by N - m and p - /m, respectively. 
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If I y.| < 1 for j = «! + 1 — ， p，we can expand (24) using (25) as 
(26) -N £ (t 7 ~3T ； + H 3 - ) = 1 N E (yf - It/ + •••)■ 

j =tti + 1 j = m + 1 

The criterion is approximately ； 7 /. The estimators ^ and A are 

found so that C - ^ - A A' is small in a statistical sense or, equivalently, 
so C - is approximately of rank m. Then the smallest p — m roots of 
HC- 中）伞—了 should be near 0. The criterion measures the deviations 
of these roots from 0. Since %, + 11 - • 1 are the nonzero roots of 
^ ' 7 (C- 2 )^" K we see that 


(27) ^ E V = M 命卞 C 一玄 ) 企叫 2 

=ytr 

/ - \ 2 
= y (n)) 

i<J Ujj 

because the diagonal elements of C - 1 are 0 . 

In many situations the investigator does not know a value of m to 
hypothesize. He or she wants to determine the smallest number of factors 
such that the model is consistent with the data. It is customary to test 
successive values of m. The investigator starts with a test that the number of 
factors is a specified m 0 (possibly 0 or 1). If that hypothesis is rejected, one 
proceeds to test that the number is m 0 + 1 . One continues in that fashion 
until a hypothesis is accepted or until |[(p - m ) 2 - p - m) <0. In the last 
event one concludes that no nontrivial factor model fits. Unfortunately, 
the probabilities of errors under this procedure are unknown，even asymptot¬ 
ically. 


14.3,3, Asymptotic Distributions of the Estimators 

A A 

The maximum likelihood estimators A and 平 maximize the average concen¬ 
trated log likelihood functions L*(C, /V *， 屯 *) given by (3) divided by N for 
= 屯 * + A*A* \ subject to A* * _1 A* being diagonal. If C is a consis¬ 

tent estimator of X (the “true” covariance matrix), then L*(C, 

L x (^ + A A \ A*, 屯 *) uniformly in probability in a neighborhood of A ， 伞， 
and L * (少 + A A/ ， A*, has a unique maximum at and A* — A. 

Because the function is continuous, the A*, that maximize 

L X (C\ A* k ^*) must converge stochastically to /V ， 屯 . 
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Theorem 14,3.1. ^ \ and are identified by V 屯 being diagonal，if 

the diagonal elements are different and ordered, and if C then 

♦ A 屯 and A A. 

A sufficient condition for C 2 is that (广 V 1 ) 1 has a distribution with 
finite second-order moments. 

The estimators A and ♦ are the solutions to the equations (10), (15)，and 
the requirement that A/ 平 _1 A is diagonal. These equations are polynomial 
equations. The derivatives of A and ♦ as functions of C are continuous 
unless they become infinite. Anderson and Rubin (1956) investigated condi¬ 
tions for the derivative to be finite and proved the following theorem: 

Theorem 14.3.2. Let 

(28) = 0 = A( 

If (9^j) is nonsingular，if \ and ^ are identified by the condition that A/ 伞 — l A 
is diagonal and the diagonal elements are different and ordered，ifC ^ \ A', 

and if yfN (C— X) has a limiting normal distribution, then ]/N (A — A.) and 
}fN (4^ - 屯 ） have a limiting normal distribution. 

For example, )fN(C- 2) will have a limiting distribution if (/' U r ) f has a 
distribution \/ith finite fourth moments. 

The covariance matrix of the limiting distribution of yfN (A - A) and 
)fN — 屯 ） is too complicated to derive or even present here. Lawley (1953) 
found covariances for }fN (A — A.) appropriate for 屯 known, and Lawley 
(1967) extended his work to the case of 屯 estimated. [See also Lawley and 
Maxwell (1971).] Jennrich and Thayer (1973) corrected an error in his work. 

The covariance of /N (& 广少 ") and }fN (i" _ in the limiting distribu¬ 
tion is 

(29) i ， 卜 

where (^ r 0 = ( 吃 ) 一 1 . The other covariances are too involved to give here. 

While the asymptotic covariances are too complicated to give insight into 
the sampling variability, they can be programmed for computation. In that 
case the parameters are replaced by their consistent estimators. 

14,3,4. Minimum-Distance Methods 

An alternative to maximum likelihood is generalized least squares. The 
estimators are the values of ^ and /V that minimize 

(30) 
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where 2 ~ ^ + /V /V' and H — or some consistent estimator of 2" 1 . 
When H - 2" 1 , the objective function is of the form 

(31) [c- or(^, \)] J [covc] _1 [c - or(^, A.)], 

where c represents the elements of C arranged in a vector, or ( 平， /V) is 
平 + /V A' arranged in a corresponding vector, and cove is the covariance 
mat r ix of c under normality [Anderson (1973a)]. Joreskog and Goldberger 
(1972) use C " 1 for H and minimize 

(32) tr(C-X)C '(C-i：)C 1 ^tr^Z-iC 1 ) 2 . 

The matrix of derivatives with respect to the elements of A. set equal to 0 
forms the matrix equation 

(33) C _ '(C-5 ； )C- | /V = 0. 

This can be rewritten as 

(34) A-SC-'A. 

Multiplication on the left by … yields (8、，which leads to (10). This 

estimator of A given ^ is the same as the maximum likelihood estimator 
except for normalization of columns. The equation obtained by setting the 
derivatives of (32) with respect to ^ equal to 0 is 

(35) diag (T 1 [(屯 + /V /V) — C] C— 1 = diagO. 

An alternative is to minimize 

(36) |tr{(^ + /V/V , )" l [ c - + ^^ ， )1) 2 - 

This leads to (8) or (10) and 

(37) diagS — l CS_ l (C_2)5T l - diagO. 

Browne (1974) showed that the generalized least squares estimator of * has 
the same asymptotic distribution as the maximum likelihood estimator. Dahm 
and Fuller (1981) showed that if cove in (31) is replaced by a matrix 
converging to cove and * ， /V， and ❿ depend on some parameters, then the 
asymptotic distributions are the same as for maximum likelihood. 


14.3.5. Relation to Principal Component Analysis 

What is the relation of maximum likelihood to the principal component 
analysis proposed by Hotelling (1933)? As explained in Chapter 11， the vector 
of sample principal components is the orthogonal transformation B ， X ， where 
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th^ columns of B are the characteristic vectors of C normalized by B B = 
Then 

p 

(38) C^BTB ^ 

1^1 

where T is the diagonal matrix with diagonal elements / 卜 ， ••，/,” the churac- 
teristic roots of C. If are small, C can be approximated by 

(39) £>〆 ,*;， 

7=1 

where n is the diagonal matrix with diagonal elements “,… 人 ， and X is 
approximated by 

m 

(40) B ] B\X= [ W). 

/ « I 

Then the sample covariance of the difference between I and the approxima¬ 
tion (40) js the sample covariance of 

(41) X-B ] B\X^B 2 B 2 X, 

which is B : T 2 B r 2 - + and the sum of the variances of the compo¬ 

nents is L^ m + l t r Here T 2 is the diagonal matrix with t m + ] J p as; 
diagonal elements. 

This analysis is in terms of some common unit of measurement. The first 
m components “explain” a large proportion of the “variance,” tr C. When the 
units of measurement are not the same (e.g., when the units are arbitrary), 
it is customary to standardize each measurement to (sample) variance L 
However, then the principal components do not have the interpretation in 
terms of variance. 

Another difference between principal component analysis and factor anal¬ 
ysis is that the former does not separate the error from the systematic part. 
This fault is easily remedied, however. Thomson (1934) proposed the follow* 
ing estimation procedure for the factor analysis model, A diagonal matrix ^ 
is subtracted from C, and the principal component analysis is carried out on 
C— 屯 . However, ^ is determined so C - 屯 is close to rank m. The 
equations are 

(42) (C-^P) A- AL, 

(43) diag( ^ + A A) — diag C, 

(44) /V A = L diagonal. 

The last equation is a normalization and takes out the indeterminacy in A. 
This method allows for the error terms, but still depends on the units of 
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measurement. The estimators are consistent but not (asymptotically) efficient 
in the usual factor analysis model. 


14.3.6. The Centroid Method 

Before the availability of high-speed computers, the centroid method was 
used almost exclusively because of its computational ease. For the sake of 
history we give a sketch of the method. Let be the correlation reduced 
matrix, that is, the matrix consisting of r l; , i ¥= j, and 1 - if/*, where iff* is an 
initial estimate of the error variance in standard deviation units. Thomson’s 
principal components approach is first to find the m characteristic vectors of 
R 0 — /?* corresponding to the m largest characteristic roots. As indicated in 
Chapter 11, one computational method involves starting with an initial 
estimate of the first vector, say x {0 \ calculating x( 】） =/? 0 x (0) , and iterating. At 
the rth step x (r) is approximately where y Y is the largest root and 

x (r), A ,(r) ~ Then =x (r) /-y/ is approximately 

the first characteristic vector normalized so y\y Y = y v To ojtain the second 
vector, apply the same procedure to R { = R* 一 3 ^ 3 ^ 

The centroid method can be considered as a very rough approximation to 
the principal component approach. With psychological tests the correlation 
matrix usually consists of positive entries, and the first characteristic vector 
has all positive components，often of about the same value. The centroid 
method uses e = (1 ， •“ ， 1)' as the initial estimate of the first vector. Then 
R* e = x ⑴ is the first iterate and should be an approximation to the first 
characteristic vector. An approximation to the first characteristic root is 
e jR* c/ c'e. Then =x (1) /\/c , /?*e is an approximation to the first charac¬ 
teristic vector of R* normalized to have length squared y v The operations 
can he carried out on an adding machine or on a desk calculator because 
R* e amounts to adding across rows and e'R^e is the sum of those row 
totals. 

The second characteristic vector is orthogonal to the first A vector 
orthogonal to e is e* consisting of p/2 Vs and p/2 - Ts. Then =x 2 is 
an approximation to the second characteristic vector, and 
approximates the second characteristic root. These operations involve chang¬ 
ing signs of entries of R ] and adding. The positions of the 一 l，s in e* are 
selected to maximize e* The procedure can be continued. 


14.4. ESTIMATION FOR FIXED FACTORS 


Let , jc p£y )' be an observation on X a given by 

(0 X a = Af a + ii + U a 
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with f a being a nonstochastic vector (an incidental parameter), a ― 1 ， … ， N, 
satisfying E^ = i / ff = 0. The likelihood function is 


( 2 ) 


I" / 9. TT 、 


Pr-F 


1 N/2 


p 

FT exD^ 


— a. 




This likelihood function does not have a maximum. To show this fact, let 
Ml =0. 入 11 = 1 ， A, ; = 0 O'^fc I), f la =x Ul . Then x ltI - ^ - = 0, 

and does not appear in the exponent but appears only in the constant. 
As (/f H 0, L 一 oo. Thus the likelihood does not have a maximum, and the 
maximum likelihood estimators do not exist [Anderson and Rubin (1956)]. 
Lawley (1941) set the partial derivatives of the likelihood equal to 0, but 
Solari (1969) showed that the solution is only a stationary value, not a 
maximum. 

Since maximum liLelihood estimators do not exist in the case of fixed 
factors, what estimation methods can be used? One possibility is to use the 
maximum likelihood method appropriate for random factors. It was stated by 
Anderson and Rubin (1956) and proved by Fuller, Pantula, and Amemiya 
(1982) in the case of identification by 0’sthat the asymptotic normal distribu¬ 
tion of the maximum likelihood estimators for the random case is the same as 
for fixed factors. 

The sample covariance matrix under normality has the noncentral Wishart 
distribution [Anderson (1946a)] depending on and iV - L Ander¬ 

son and Rubin (1956) proposed maximizing this likelihood function. How¬ 
ever, one of the equations is difficult to solve. Again the estimators are 
asymptotically equivalent to the maximum likelihood estimators for the 
random-factor case. 


14.5. FACTOR INTERPRETATION AND TRANSFORMATION 
14*5.1. Interpretation 

The identification restrictions of \ ,J V~ ! A diagonal or the first m rows of A 
being l m may be convenient for computing the maximum likelihood estima¬ 
tors, but the components of the factor score vector may not have any intrinsic 
meaning. We saw in Section 14.2 that 0 coefficients may give meaning to a 
factor by the fact that this factor does not affect certain tests. Similarly, large 
factor loadings may help in interpreting a factor* The coefficient of verbal 
ability, for example, should be large on tests that look like they are verbal* 
In psychology each variable or factor usually has a natural positive direc¬ 
tion: more answers right on a test and more of the ability represented by the 
factor. It is usually expected that more ability leads to higher performance; 
that is，the factor loading should be positive if it is not 0« Therefore, roughly 
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Ail 


Figure 14.1. Rows of A, 

speaking, for the sake of interpretation, one may loolc for factor loadings that 
are either 0 or positive and large. 

14.5 工 Transformations 

The maximum likelihood estimators on the basis of some arbitrary identifica - 

• A A 

tion conditions including = / are A and We consider transformations 

(l) \* = ap, ^* = p~ l (p~'y = (p'py'. 

If the factors are to be orthogonal, then O* = I and P is orthogonal. If the 
factors are permitted to be oblique, P can be an arbitrary nonsingular matrix 
and O* an arbitraiy positive definite matrix. 

The rows of A can be plotted in an m-dimensional space. Figure 14.1 is a 
plot of the rows of a 5 X 2 matrix A. The coordinates refer to factors and the 
points refer to tests. If 4>* is required to be Jt m , we are seeking a rotation of 
coordinate axes in this space. In the example that is graphed, a rotation of 
45。 would put all of the points into the positive quadrant, that is, A* 0, One 
of the new coordinates would be large for each of the first three points and 
small for the other two points, and the other coordinate would be small for 
the first three and large for the last two. The first factor is representative of 
what is common to the first three tests, aid the second factor of what is 
common to the last two tests. 

If m > 2, a general rotation can be approximated manually by a sequence 
of two-dimensional rotations. 
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If is not required to be the transformation P is simply nonsingu¬ 
lar* If the normalization of the /th column of A is ; = 1, then 


( 2 ) 


1 = A/)" = E K{j),kPkJ^ 


each column of P satisfies such a constraint. If the normalization is = 1, 
then 

( 3 ) 

Ar-l 

where ( p Jk ) = 

Of the various computational procedures that are based on optimizing an 
objective function, we describe the varimax method proposed by Kaiser 
(1958) to be carried out on pairs of factors, Horst (1965)，Chapter 18, 
extended the method to be done on all factors simultaneously. A modified 
criterion is 



m p { 

E E a* 2 - 

;-l 




E Af； - 

I == 1 


([lO 2 

P~ ^ 


which is proportional to the sum of the column variances of the squares of 
the transformed factor loadings. The orthogonal matrix P is selected so as to 
maximize (4). The procedure tends to maximize the scatter of A。 2 within 
columns. Since ^ 0, there is a tendency to obtain some large loadings and 
some near 0. Kaiser’s original criterion was (4) with A* 2 replaced by 

/V'm \*2. 

A U / \ A ih ^ 

Lawley and Maxwell (1971) describe other criteria. One of them is a 
measure of similarity to a predetermined p 乂 m matrix of l 5 s and 0 ， s r 


14.53. Orthogonal versus Oblique Factors 

In the case of orthogonal factors the components are uncorrelated in the 
population or in the sample according to whether the factors are considered 
random or fixed* The idea of uncorrelated factor scores has appeal. Some 
psychologists claim that the orthogonality of the factor scores is essential if 
one is to consider the factor scores more basic than the test scores. Consider^ 
able debate has gone on among psychologists concerning this point. On the 
ottier side, Thurstone (1947), page vii, says “it seems just as unnecessary to 
require that mental traits shall be uncorrelated in the general population as 
to require that height and weight be uncorrelated in the general population •” 
As we have seen, given a pair of matrices equivalent pairs are given 

by A P, 尸一 1 中 pr 1 for nonsingular P's. The pair may be selected (i,e.^ the P 
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given A.4>) as the one with the most meaningful interpretation in terms of 
the subject matter of the tests. The idea of simple structure is that with 0 
factor loadings in certain patterns the component factor scores can be given 
meaning regardless of the moment matrix. Permitting 4> to be an arbitrary 
positive definite matrix allows more O’sin A, 

Another consideration in selecting transformations or identification condi¬ 
tions is autonomy, or permanence, or invariance with regard to certain 
changes. For example, what happens if a selection of the constituents of a 
population is made? In case of intelligence tests, suppose a selection is made, 
such as college admittees out of high school seniors, that can be assumed to 
involve the primary abilities. One can envisage that the relation between 
unobserved factor scores / and observed test scores x is unaffected by the 
selection, that is* that the matrix of factor loadings A is unchanged. The 
variance of the errors (and specific factors), the diagonal elements of 少， may 
also be considered as unchanged by the selection because the errors are 
uncorrelated with the factors (primaiy abilities). 

Suppose there is a true model. A, O, and the investigator applies 
identification conditions that permit him to discover it. Next, suppose there is 
a selection that results in a new population of factor scores so that their 
covariance matrix is When the investigator analyzes the new observed 
covariance matrix 市 + AO*A', will he find A again? If part of the identifi¬ 
cation conditions are that the factor moment matrix is /, then he will obtain 
a different factor loading matrix. On the other hand, if the identification 
conditions are entirely on the factor loadings (specified O’sand l’s)，the factor 
loading matrix from the analysis is the same as before. 

The same consideration is relevant in comparing two populations. It may 
be reasonable to consider that 屯 ,= 屯 2 , A 1 = A 2 , but 关 4> 2< To test the 
hypothesis that 4>, = 4> 2 , one wants to use identification conditions that 
agree with A, = A : (rather than A , = A 2 CX The condition should be on 
the factor loadings. 

What happens if more tests are added (or deleted)? In addition to 
observing X=A/-h|x-hl/, suppose one observes X* = + |x* + [/*, 

where U f is uncon chitccl with U, Since the common factors / are un¬ 
changed, 4> is unchanged. However, the (arbitrary) condition that 八’屯 — 1 八 
is diagonal is changed; use of this type of condition would lead to a rotation 
of 


14.6. ESTIMATION FOR IDENTIFICATION BY SPECIFIED ZEROS 

Wc now consider estimation of /V, 平， and 4> when 4> is unrestricted and A 
is identified by specified O’sand l’s，We assume that each column of A has at 
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least m + 1 0’s in specified positions and that the submatrix consisting of the 
rows of A containing the 0’s specified for a given column is of rank m - 1. 
(See Section 14,2.2.) We further assume that each column of A has 1 in a 
specified position or, alternatively, that the diagonal element of ^ corre¬ 
sponding to that column is 1. Then the model is identified. 

The likelihood function is given by (1) of Section 143, The derivatives of 
the likelihood function set equal to 0 are 

(1) diagS 一 1 [C - (平 + =diagO ， 

( 2 ) (^ + = 0 

for positions in 中 that are not specified, and 

(3) 2 _1 [C- (^ +A^>A , )]2~ 1 A = 0 
for positions in A not specified, where 

(4) A^A. 

These equations cannot be simplified as in Section 143.1 because (3) holds 
only for unspecified positions in A, and hence one cannot multiply by 2 on 
the left. [See Howe (1955), Anderson and Rubin (1956), and Lawley (1958).] 
These equations are not useful for computation. The likelihood function, 
however, can be maximized numerically. 

As noted before, a change in units of measurement, X* — DX, results in a 
corresponding change in the parameters A and if identification is by 0 in 
specified positions of A and normalization is by ^ = 1, / = It is 

readily verified that the derivative equations (1), (2), (3), and (4) are changed 
in a corresponding manner. 

Anderson and Amemiya (1988a) have derived the asymptotic distribution 
of the estimators under general conditions. Normality of the observations is 
not required. See also Anderson and Amemiya (1988b). 


14/7. ESTIMATION OF FACTOR SCORES 

It is frequently of interest to estimate the factor scores of the individuals in 
the group being studied In the model with nonstochastic factors the factor 
scores are incidental paraneters that characterize the individuals. As we 
have seen (Section 14.4), the maximum likelihood estimators of the parame¬ 
ters d A ， |x，/"",，/ w )do not exist. We shall therefore study the estima¬ 
tion of the factor scores on the basis that the structural parameters ( 屯， A ， |x) 
are known. 
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When f a is considered as an incidental parameter, x a — 11 is an observa¬ 
tion from a distribution with mean A f a and covariance matrix 屯 - The 
weighted least squares estimator of f a is 

( 1 ) 

= T ^A'^ l (x a -vi), 

where T =八'屯— 1 八 (not necessarily diagonal). This estimator is unbiased 
and its covariance matrix is 

(2) '-i' 1 

by the usual generalized least squares theory [Bartlett (1937b)，（1938)]. It is 
the minimum variance unbiased linear estimator of /„. If x a is normal, the 
estimator is also maximum likelihood 

When f a is considered random [TKomson (1951)], we suppose X a and f a 
have a joint normal distribution with mean vector 0 f ) f and covariance 
matrix 

⑺ 翁 m . 

Then the regression of / on X (Section 2.5) is 

(4) = OA f ( ^ |i) 

- + ^ jji). 

The estimator or predictor of f a is 

(5) f* = + \x a - it). 

If ^ = /, the predictor is 

(6) /XJ + lVA' 屯 — (n). 

When T is also diagonal, the /th element of (6) is y,/(l + y ; ) times the /th 
element of (1), In the conditional distribution of x a given f a (for = /) 

( 7 ) ^(/ ： \f a )-(i + ry x Tf a , 

^{f ： \f a ) = (i + ry i T(i + ry\ 


( 8 ) 
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( 10 ) <^(f ： -f a )(f ： -f a )' = (i + ry ] . 


This last matrix, describing the mean squared error, is smaller than (2) 
describing the unbiased estimator. The estimator (5) or (6) is a Bayes 
estimator and is appropriate when f a is treated as random. 


PROBLEMS 


14, L (Sec，]4,2) Identification by Let 


0 A' 10 


C 


c 12 

c 22 


where C is nonsingular. Show that 


AC = 


0 A * (h 、 


implies 



if and only if A (,) is of rank m - \ , 

14.2. (Sec, 14,3) For p = 3, m = L and A = A* prove | = n ;、，（ A?/ 山, ,)， 

143. (Sec. 14.3) The EM algorithm, 

(a) If / and U are normal and / and X are observed、show that the likelihood 
function based on (x” ^) ，，，， ,(x v ，/、) is 


N 

n { — i —— t.xp 

a-. (27r)^n/L^, 


1 ^ (wE;L| VJ- 

2 L 




(2t7)^'|4>| 
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(b) Shw tbai when the factor scores are included as data the sufficient set of 
statistics is i, f< C XX =C > 

C . t /= E (■>^-1)(/。-/)、 

a = 1 

C "= 去 E (/a 

a - \ 

(c) Show that the conditional expectations of the covariances in (b) given 
X = (x|,,..,jc A/ ), A, and ^ are 

c* f = ^(C vf \x, A ， 中 ，屮 ) =Cn (屮 + A<DA ，） _ 1 A<D, 

Cf f = <^(C //r |X, A,«I>,^) = «I>A , (^ +A«UA')" I C„(^ +A«U 

+ <D - <DA (^ +A<I>. 

(d) Show that the maximum likelihood estimators of A and ^ given <I> = / are 


A = C* / C； / -', 

令 =c: x -c: f cm 



CHAPTER 15 


Patterns of Dependence; 
Graphical Models 


1S.L INTRODUCTION 

An emphasis in multivariate statistical analysis is that several measurements 
on a number in Individuals or objects may be correlated, and the methods 
developed in this book take account of that dependence. The amount of 
association between two variables may be measured by the (Pearson) correla¬ 
tion of them (a symmetric measure); the association between one variable 
and a set may be quantified by a multiple correlation； and the dependence 
between one set and another set may be studied by criteria of independence 
such as studied in Chapter 9 or by canonical correlations* Similar measures 
can be applied in conditional distributions* Another kind of dependence 
(asymmetrical) is characterized by regression coefficients and related mea- 
sures. In this chapter we study models which involve several kinds of 
dependence or more intricate patterns of dependence. 

A graphical model in statistics is a visual diagram in which observable 
variables are identified with points {vertices or nodes) connected by edges and 
an associated family of probability distributions satisfying some indepen¬ 
dences specified by the visual pattern. Edges may be undirected (drawn as 
line segments) or directed (drawn as arrows). Undirected edges have to do 
with symmetrical dependence and independence, while directed edges may 
reflect a possible direction of action or sequence in time. These indepen¬ 
dences may come from a priori knowledge of the subject matter or may 
derive from these or other data. Advantages of the graphical display include 
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ease of comprehension, particularly of complicated patterns，ease of elicita' 
tion of expert opinion, and ease of comparing probabilities. 

Use of such diagrams goes back at least to the work of the geneticist 
Sewall Wright (1921),(1934), who used the term “path analysis.” An elabo¬ 
rate algebra has been developed for graphical models. Specification of 
independences reduces the number of parameters to be determined. Some of 
these independences are known as Markov properties. In a time series analysis 
of a Markov process (or order 1), for example, the iiiture of the process is 
considered independent of the past when the present is given； in such a 
model the correlation between a variable in the past and a variable in the 
future is determined by the correlation between the present variable and the 
variable of the immediate future. This idea is expanded in several ways. 

The family of probability distributions associated with a given diagram 
depends on the properties of the distribution that are represented by the 
graph. These properties for diagrams consisting of undirected edges (known 
as undirected graphs) will be described in Section 15.2; the properties for 
diagrams consisting entirely of directed edges (known as directed graphs) in 
Section 15.3; and properties of diagrams with both types of edges in Section 
15.4. The methods of statistical inference will he given in Section 15,5, 

In this chapter we assume that the variables have a joint nonsingular 
normal distribution; hence, the characterization of a model is in terms of the 
covariance matrix and its inverse, and functions of them. This assumption 
implies that the variables are quantitative and have a positive density* The 
mathematics of graphical models may apply to discrete variables (contingency 
tables) and to nonnormal quantitative variables, but we shall not develop the 
theory necessary to include them* 

There is a considerable social science literature that has followed Wright’s 
original work. For recent reviews of this writing see, for example, Pearl 
(2000) and McDonald (2002). 


15.2. UNDIRECTED GRAPHS 

A graph Is a set of vertices and edges, G = (V, E\ Each vertex is identified 
with a random vector. In this chapter the random variables have a joint 
normal distribution* Each undirected edge is a line connecting two vertices. It 
is designated by its two end points; is the same as (u 7 u) in an 

undirected graph (but not in directed graphs). 

Two vertices connected by an edge are called adjacent; if not connected by 
an edge, they are called nonadjacent. In Figure 15.1(a) all vertices are 
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Figure 15.1 


nonadjacent; in (b) a and b aie adjacent; in (c) the pair a and b and the pair 
a and c are adjacent; in (d) every pair of vertices are adjacent. 

The family of (normal) distributions associated with G is defined by a set 
of requirements on conditional distributions, known as Markoo properties. 
Since the distributions considered here are normal, the conditions have to do 
with the covariance matrix 2 and its inverse A ― 2" 1 , which is known as the 
concentration matrix. However, many of the lemmas and theorems hold for 
nonnormal distributions. We shall consider three definitions of Markov and 
then show that they are equivalent. 

Definition 15,2.1- The probability distribution on a graph is pmrwise 
Markov with respect to G if for every pair of vertices (w, v) that are not adjacent 
X u andX u are independent conditional on all the other variables in the graph. 


In symbols 

⑴ JL 


where Jl means independence and |/\(u, i；) indicates the set V with u and v 
deleted. The definition of pairwise Markov is that p uv v\ {UA y = 0 for all pairs 
for which (u y u) € E. We may also write u Jl ^ u\ 

Let 2 and A *= be partitioned as 


( 2 ) 



^AA 

^/\H 


A = 


A H 

^BA 


j 

^ BA 

^ BB 


where A and B are disjoint sets of vertices. The conditional distribution of 
X A given X ti is 

(3) — ^ ^ B/4 ) * 

The condition il covariance matrix is 

( 4 ) B = ^AA ~~ = 
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If A ^ (1,2) and S = (3, •"，/?)， the covariance of X { and X 2 given X 3 , … ， X p 
is a l2 ^ p in ^ (cr, r3 This is 0 if and only if A l2 ^ 0 ； that is, X A . B 
is diagonal if and only if \ AA is diagonal. 

Theorem 15,2.1. If a distribution on a graph is pairwise Marrow, A f； . = 0 for 
(i ， /) 在 [/. 

Definition 15.2.2. The boundary of a set A, termed bd(A), consists of 
those vertices not in A that are adjacent to A. The closure of A, termed cl(/4), L % 
A U bd(/l). 

Definition 15.2.3. A distribution on a graph is locally Markov if for every 
uertex v the variable X 0 is independent of the variables not in c\(B) conditional on 
the boundary of v: in notation, 

(5) \JL 

\ d( w) 11 bd (w) • 

Theorem 15.2.2. The conditional independences 

(6) X MY\Z, X JLZ\Y 
hold if and only if 

(7) X±{Y,Z). 

Proof. The relations (6) imply that the density of X, Y, and Z can be 
written as 

(S) f(x,y,z) =f{x\z)g(y\z)h{z) 

= k(x\y)l(z\y)m(y). 

Since g(jlz)Mz) = n(y, z) = l(z\y)m(y), (8) implies f(x\z)= 々 (x|_y)，which in 
turn implies f{x\z) = /c(xl_y) = p(x). Hence 

(9) f{^,y,z) ^p(x)n(y,z), 

which is the density generating (7). Conversely, (9) can be written as either 
form in (8), implying (7). ■ 

Corollary 15.2.1. The relations 

(10) X ±Y\Z,W, X ±Z]Y,W 

hold if and only if 

(11) X ^{Y,Z)\W. 
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The relations in Theorem 15.2.2 and Corollary 15.2.1 are sometimes called 
the block independence theorem. They are based on positive densities, that is, 
nonsingular normal distributions. 

Theorem 15*2.3. A locally Markov distribution on a graph is pairwise 
Markov. 

Proof. Suppose the graph is locally Markov (Definition 15.23). Let u and 
v be non adjacent vertices. Because v is not adjacent to u, it is not in bd(w); 
hence, 

(12) X u JL ^V\ C | ⑷ 1 义 bd(u )， 

The relation (12) can be written 

(13) X u JL{^, ^^\[, Jl£ ； .bci(w)]}|t ) d(u). 

Then Corollary 15.2.1 {X = X lt , Y = X ut Z = Z^^ [cl(u) u]y ^y = X M{u) ) implies 

(14) 尤 “ 1 ■ 

Theorem 15.2*4. A pairwise Markov distribution on a graph is locally 
Markov. 

Proof. Let l/\cl(w) = v 】 U … U 〜Then 

(15) u JL Albc^w) Uv 2 U - ' U v n , u JL u 2 \ bd(u) U 4 U U ••• u %, 
which by Corollary 15.2.1 implies 

(16) u JL £；, U t> 2 |bd(w) U u 3 U … U u n . 

Further, (16) and 

(17) u JL y 3 |bd(u) U 4 U U % U … U % 
imply 

(18) w JL i ；】 U U u 3 | bd(w) U % u •" U 
This procedure leads to 

(19) w JlR U … U^|bd(w). ■ 


A third notion of Markov, namely, global, requires some definitions. 



600 


PATTERNS OF DEPENDENCE ； GRAPHICAL MODELS 


Definition 15.2-4 A path from B to C is a sequence u ily u v u 2 ,^.yU n of 
adjacent vertices with u 0 e B and u n e C. 

Definition 15.2.5. A set S separates sets B and C if S ， B，and C are 
disjoint and every path from B to C intersects 5. 

Thus S separates B and C if for every sequence of vertices 
with u q ^B and e C at least one of u 1? .. is a vertex in 5. Here B 

and/or C are nonempty, but S can be empty. 

Definition 15.2.6* A distribution on a graph is globally Markov if for every 
triplet of disjoint sets S, B, and C such that S separates B and C the oector 
variables X B and X c are independent conditional on X s - 

In the example of Figure 15.1(c), a separates b and c. If p bc a = 0, that is, 
p bc - p ba p ac = 0, the distribution is globally Markov. Note that a set of 
vertices is identified with a vector of variables. 

The global Markov property puts restrictions on the possible (normal) 
distributions, and that implies fewer parameters about which to make infer¬ 
ences. 

Suppose V — A U 6 U 5, where A, 5, and S are disjoint. Partition 2 and 
八 =1 一 】， the concentration matrix, as 


The conditional distribution of X f B y given X s is normal with covariance 
matrix 


^AA ^AB ^AS vUV V 1 

_ I V V — V ^SS l^SA ^SB J 

I ^BR J ^BS 


八 /M 

^ BA Aflfl 


Theorem 15.2.5. If S separates A and B in a graph with a globally Markov 
distribution, \ AB = 0. 


，-* s 
A- s 

2 22 

5 5b 
/( fl 5 

2 22 


' / V 

2 2 2 


AAA 

fl fl fl 
/tfi 5 

AAA 


AAA 


Proof. Because 5 separates A and B, every element u of A and every 
element u of B are nonadjacent, for otherwise the path (w, u) would connect 
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A and B without intersecting S, Tlic globally Markov property is that ,V, 
and X B are uncorrelated in the conditional distribution, implying that ；il s 
is block diagonal and hence that = 0. ■ 

Theorem 15.2.6. A distribution on a globally Markov graph is pairwise 
Markov. 

Proof. Let the set B be /, the set C be j not adjacent to /. and the set A 
the rest of the variables. Any path from B to C must include elements of A, 
Hence i is independent of j in the distribution conditioned on the other 
variables. ■ 

Theorem 15.2J, A globally Markov family of distributions on a graph is 
locally Markov. 

Proof. The boundary of a set B separates B and V\c\{B). ■ 

Theorem 15,2.8, A pairwise Markov family of distributions on a graph is 
globally Markov. 

Proof. Let A, 5, and S be disjoint sets in a pairwise Markov graph such 
that S separates A and B. Let #(5) and #(K) denoic the numbers of 
vertices in S and V, respectively. If #{V) = #(5) + 2, that is, V = A B U S. 
then there must be one vertex in each of A and and the pairwise Markov 
property is exactly the globally Markov property. The rest of the proof is a 
backward induction on #(5). Suppose #{V) - #(5) > 2 and V = A U B U 5. 
Then either ^4 or B or both have more than one vertex. Suppose A has more 
than one vertex，and let u Then 5 U w separates .-4 \m and B, and 
S UA separates u and B. By the induction hypothesis 

(22) l\ u JLX fl |(U u )， 

By Corollary 15.2.1 

(23) 

Now suppose y4 U B U 5 c K Let u e V\{A U B U S). Then S U u separates 
A and B. By the induction hypothesis 

(24) X A ±X B \(X Sy X u ). 

Also, either A^J S separates u and B or B U 5 separates A and u. 
(Otherwise there would be a path from B to u and from u to A that would 
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not intersect ll A U S sepa aies // B, 

(25) X,,iL 〜（ U) ‘ 

Then Corollary 15.2.1 applied to (19) and (20) implies 

( 26 ) (U)iLX 爲， 

from which we derive X A iL ■ 

Theorems 15.23, 15.2.5, and 152.6 show that the three Markov properties 
are equivalent: any one implies the other two. The proofs here hold fairly 
generally, but in this chapter a nonsingular mult 1 variate normal distribution is 
assumed: thus all densities are positive. 

Definition 15.2.7. A graph G — (V, E) is complete if and only if every two 
vertices in V are adjacent. 


The definition implies that the graph specifies no restriction on the 
covariance matrix of the multivariate normal distribution. 

A subset .4 c K induces a subgraph G., = (A, E A \ where the edge set E A 
includes all edges {u,v) of G with (u,v) where u and v gA. 
A subset of a graph is complete if and only if every two vertices in A are 
adjacent in E A . 

Definition 15*2*8. A clique is a maximal complete set of vertices. 

"Maximal^ means that if another vertex from V is added to the set, the set 
will no longer be complete. A clique can be constructed by starting with one 
vertex, say v v If it is not adjacent to any other vertex, lj, alone constitutes a 
clique. If k adjacent to d 2 ) e£], continue constructing a clique 

with and v 2 in it until a maximal complete subset is obtained. Thus every 
vertex is a member of at least one clique, and every edge is included in at 
least one clique. 

Lemma 15*2*1. If the (Iistrihnnon ofX'.is Markov, it Ls determined by the 
set of marginal distributions of all cliques. 

In Figure 15.1(a) each of a, b, c is a clique; in (b) each of (a, b) and c is a 
clique; in (c) each of (“，/)) anti (a, r) is a clique; in (d) (a, b 9 c) is a clique. 

Definition 15.2.9, The density f(X v ) factorizes with respect to G if there 
are nontiegative[unctionsg c (X c ) clepenclin^ on the complete subgraphs such that 

/(o n sc'Uc) ‘ 

C comploic 


( 27 ) 
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Since it suffices to consider only cliques, an alternative factorization is 

( 28 ) f{X v )= n 

C* cliques 

These functions gc(X c ) and g c *(X c +) are not necessarily densities or 
conditional densities. The problems of statistical inference may be reduced to 
the problems of the complete subgraphs or cliques. 

Definition 15«2«10« A decomposition of a graph h formed by three disjoint 
sets A, B, S if y = A U B U S y S separates A and B，and S is complete. 

In this definition one or more of the sets A, B y and S may be empty* If 
both A and B are nonempty, the decomposition is termed proper. 

Definition 15*2.11. A graph is decomposable if it is complete or if there is 
a proper decomposition (A, B, S) into decomposable subgraphs G AvS and 
^Bus- 

Theorem 15,2,9, Suppose A, 5,5 decomposes G = (|/, £), Then the density 
ofXy factorizes with respect to G if and only if its marginal densities f A uS (x AUS ) 
and f B v s (x sus ) petorize and the densities satisfy 


fAus( x A\js)fB^s( x By 

fsM 


Proof. Suppose that f v (x v ) factorizes as 


(30) fy( x y) = n Sc( x c)- 

cGC 


Because A 9 fl, S decomposes G, every clique is either a subset of U or a 
subset of 5 U 5* Let 义 denote the cliques that are subsets of A U S f and 36 
those that are subsets of B, Then f y (x y ) = h(x AUS )k(x BuS X where 


(31) 

(32) 


k x aus) = n (〜) ， 

Ce,c/ 


^( x Bvs) = FI Sc( x c)- 

Ce 


Integration of (30) with respect to x c for C ^ gives 


( 33 ) 


fAUs( X AyJs)= h ( X AU S ) l <(^s), 
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difficulty grade recommendation 

i 3 4 



IQ SAT 

Figure 15.2 


where 

(34) k(x s ) = flc(x BuS )dx B . U 

In turn f Au s^ x Aus^ anc * fBus( x Bus) can be factorized, leading to (28). 

153. DIRECTED GRAPHS 

We now include relations with a direction; the measurement represented by 
one vertex u may precede the measurement represented by another vertex u. 
In the graph this directed edge is displayed as an arrow pointing from u to v\ 
in notation it appears as (w, z;), which is now distinguished from (u ， u). The 
precedence may indicate the times of measurement, for example, the precipi¬ 
tation on two successive days, or may indicate possible causation. 

The difficulty of an examination x { may affect the grade of a student x 3 ; 
the grade is also affected by his/her IQ x 2 . In turn the grade of the student 
influences the quality of a letter of recommendation x 4 ; the IQ is a factor in 
performance on the SAT, x 5 . See Figure 15.2. (We shall draw figures so that 
the action proceeds from left to right.) 

A graph composed entirely of directed edges is called a directed graph, A 
cycle，such as 1 2, 2 - »3, 3 - > 1, is hard to interpret and hence is usually 

ruled out, A directed graph without a cycle is an acyclic directed graph (ADG 
or DAG), also known as an acyclic digraph. All directed graphs in this 
chapter are acyclic. 

An acyclic directed graph may represent a recursive linear system. For 
example, Figure 15.2 could represent 


⑴ 

= «i ， 

( 2 ) 

^2 = ^2 5 
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(3) 

尤 3 = 

= + 玲32义 2 + W 

(4) 

夂= 

=/3 43 + “ 4 ， 

(5) 

4 = 

= /^52 ^2 + 


where u,, u 2 , u y , u 4 , w 5 arc mutually independent unobscr^cd variables. 

Wold (1960) called such models causal chains. Note that the matrix of 

coefficients is lower triangular. In general X, may depend on A'. X , 卜 

The recursive linear system (l) to (5) generates the recursive factorization 

( 6 ) f\2M5( X ^ X 2^ X ^ X ^ X ^) 

=L f\( X \)fl( 文 2),3|12(''1 义 ’1 ， ■ C 2)/4|123( X 4l X 3),'l: ： U( A 、 l A 2) • 

A directed graph induces a partial order. 

Definition 15.3.L A partial ordering of an acyclic directed graph u <v is 
defined by the existence of a directed path 

(7) w = u 0 — h … u fl = u. 

The partial ordering satisfies the conditions (i) reflexive ； u < v\ (ii) transitive; 
u <v and v<w imply u <h; and (iii) antisymmetric ： u<v and d<u imply 
u = xK Further, u < u and u v defines u < l\ 

Definition 15.3.2. If u — then u is a parent of v, termed u = pa(/;), and 
v is a child of u ， termed v = ch(w). /；i symbols 

(8) pa(u) = (w e 卜一 ， 

(9) ch(w) = {we ^\u|w ^ w}, 

In the graph displayed in Figure 15.2 we have (l ， 2) = pa ⑶， 3 f - pa(4). 
2 = pa(5), 3 — ch(l ， 2)，4 = ch(3), and 5 = ch(2). 

Definition 15.3.3. If u < u, then v is a descendant of u, 

(10) de(w) = {v\u <u), 
and u is an ancestor of v, 

(11) an(u) = {u\u < . 

The set of nondescendants of u is Nd (“） = and the set of strict 

nondescendants is nd(u) = Nd(u)\u. Define An(A) = an(^) uA f 




606 


PATTERNS OF DEPENDENCE ； GRAPHICAL MODELS 


pa(v) 



• w 

Figure 15.3 


Note that 

(12) pa(u) can(u) Qnd(v). 

In our study of undirected graphs we considered three Markov properties 
independently defined and then showed that a graph with one Markov 
property also has the other two. In the case of acyclic directed graphs wc 
shall define three similar Markov properties, but the definitions are different 
because they take account of the direction of action or influence. 

Definition 15.3.4. A distribution on an acyclic directed graph G is pairwise 
Markov if fot every u e Vand w e nd(u)\pa(u) 

(13) v JLw\ nd( u) \ w. 

In comparison with Definition 15,2.1 for undirected graphs, note that 
attention is paid only to vertices in nd(u); since pa(u) is the effective 
boundary of the vertices w and v are nonadjacent. (See Figure 15.3.) Note 
also that the conditioning set inclues the parents of u, but not the children 
(which are descendants). 

Definition 15.3.5. A distribution on an acyclic directed graph is locally 
Markov if 

(14) d il[nd(u)\pa(Li)] Ipa(ti). 

In the definition of locally Markov the conditioning is only on the parents 
of l\ but in the definition of pairwise Markov the conditioning is on all of the 
other nondescendants. These features correspond to Definitions 15.2.1 and 
15,2.3 for undirected graphs. 

In Figure 15.2, we have lil2,5, 3il5|2, 4iLl,2,5|3, and 5ill,3,4|2. 
In an undirected graph constructed by replacing arrows in Figure 15.2 by 
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lines (directed edges by undirected edges), a locally Markov distribution on 
the graph would include the conditional independences 1 JL2|3, 1,2JL4|3, 
l,3,4_ll5. In the interpretation of the arrow indicating time sequence X A 
relates to the future of (X 2 y the future cannot be conditioned on. 

As another example, consider an autoregressive time series yo ， *Vi，_"，>V 
defined by 

(15) y t = py t ^ x i= 

where u x ,.. . ? w r are independent N(0 f cr 2 ) variables and y 0 has distribution 
jV[0, cr 2 /(l - p) 2 )]. In this case given y f9 the future 为 +1 ，…， y T is indepen¬ 
dent of the past ■■ ， } 卜 i. 

Theorem 153.L A locally Markov distribution on an acyclic directed graph 
is pairwise Markov. 

Proof. The proof is the same as the proof of Theorem 15.2.3 for undi¬ 
rected graphs. ■ 

Theorem 15.3.2. A pairwm Markov distribution on an acyclic directed 
graph is locally Markov. 

Proof- The proof is the same as the proof of Theorem 15.2.4. ■ 

Another Markov property is based on numbering the vertices in an order 
reflecting the direction of the action or the partial ordering induced. 

Definition 15.3.6. An enumeration of the elements of V is colled well- 
numbered if i </ => v } < u tf or equivalently < y, => j < i. 

Theorem 15.3.3. A finite ordered set (K, <) admits at least one well¬ 
numbering. 

Definition 15*3.7. An element a* is maximal (or terminal) if a* < b 
==> a* =b. 

Lemma 15.3*1. A finite，partially ordered set {V 7 <) has at least one 
maximal element a*. 

Proof of Lemma' The proof is by induction with a* = a if #(K) = 1. 
Assume the lemma holds for #{V) = n 7 and consider #(K) = n + l, Then 
V = aU (K\fl) for any e K Since #(K\a) = rt, V\a has a maximal ele¬ 
ment, say a. Then either a <a and so a is maximal, or a <a and so a is 
maximal. ■ 
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Proof of Theorem 75,3.5, We shall construct a well-numbering. Let v* be a 
maximal element; define v n = u*. In K\u /( let u** be a maximal element; 
define v n _ { - i;**. At the yth stage let i;*** be a maximal element in 
^'(化， " ， ， ％ 一 ”i); define v n _ } = v***, / = 3, f ,,, n - Then v l = 
This construction satisfies Definition 15,3,6. ■ 

The well-numbering of V us u (l) ,,.,, u (,,) implies 1 that in any directed path 
u = u^ a) u (lf) u 0n) = u the indices satisfy i 0 < < •- < i n . The 

well-numbering is not necessarily unique, Since V is finite, a maximal 
element can be found by comparing v, and v } for at most n(n - 1)/2 pairs. 

Definition 15.3.8, Let , v n ) be a well-numbering of the acyclic di¬ 

rected graph G. A diKtnbutiou ol G is well-numbered Markov with respect to 
this well-numbering if 

(16) u,-JL(u 1 ,...,u,_ 1 )\pa(u f )|pa(u,), z = 3, …， n. 

Apparently the definition depends on the choice of well-numbering, but 
this is not the case, by Theorem 15,3.4, 

Theorem 15.3.4. A distribution on an acyclic directed graph that ls well- 
numbered Markov is locally Markov, 

Proof. nd(u,)\pa(^X ■ 

The definition of the global Markov property depends on relating the 
directed graph to a corresponding undirected graph. 

Definition 15.3.9. The moral graph G m of an acyclic directed graph G = 
(K, E) is the undirectedgf'aph comtnictcd by adding (ntidirectcd) edges between 
parents of each vertex v^V and replacing every directed edge by an undirected 
edge. 

In the iargon of graph theory, the parents of a vertex are “married.” 

Definition 15.3.10. A distribution on an acyclic directed graph is globally 
Markov if A M B\S for every A, B, and S such that S separates A and B in 

^An(Av BvS^' 

Theorem 15.3.5. A distribution on an acyclic directed graph that is globally 
Markov w loealfy Markov. 
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Proof. For any v^V let pa(u) = S in the definition of globally Markov. 
Let v — A and nd(tO\pa(iO = B. A vertex w e nd(u)\pa(i;) is a vertex (n 
An(v4 U B U 5). Let tt= w = .. t v n = u be a path from w to n in 

[ G AnMuflus)r = [ G Nd ⑼]' If corresponds to a directed edge 

(%-i —%) in then e pa(u) = S and paO) separates 

nd(u)\pa(u) and v. [The directed edge (y„_j — u n ) implies i> n _ x ede(i0_] 

■ 

Theorem 15.3.6* A distribution on an acyclic directed graph that is locally 
Markov is globally Markov. 


The proof is very lengthy and is omitted. 

Recursive Factorization 

The recursive aspect of the acyclic directed graph permits a systematic 
factorization of the density. Use the construction of Theorem 15,34. Let 
n = \V\; then o n is a maximal element of V. Then 

(17) h\ cl(0 JL X,,Jpa(ij n ). 

Thus (in the normal case) 

( 18 ) SX Un \^{v n )^a n +B n X^ 

(19) nrJdRX 

At the /th step let v n _ ) + 1 be a maximal element of K\(d„, ..., D n _ / + : ). Then 

( 20 ) . ，•… ,)] 儿 \卞, IP a ( u n —)• 

Thus 

(21) 十 ,lpaO„-; +1 ) = a,,_ ;+1 +B„- /+1 

(22) 一，-灯〜 ,）（ 及十，-+, 卜 S „， +1 ， 

The vector +i is independent of pa(u n _ ; + I ). The relations (18) to (22) 
can be written as generating equations. Let 


(23) 


+ e p 


(24) 

X 2 = a 2 

,+B 2 x, + e 2l 


(25) 

X n -\ = a n 

i«i + A -1 ( x ’i … 

.d +e 

(26) 

X n = a , 

〖 + J5 n ( ,,,., x n 

-J+L ， 
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where ■ • _, are independent random vectors with S'€〆)• = In matrix 

form (23) to (26) are 

(27) Bx = a + e, 

where 





I 

0 

0 

. 0 


e i 




~ B 2\ 

/ 

0 

■ 0 


e 2 

(28) ot — 


， B = 


~ B n 

I 

■ 0 

,e = 

e 3 




~ B n\ 


一 - 

•- I 


B ti 


and B }J = 0 if / </ - k } . Because the determinant of B is 1, (27) can be 
solved for 

(29) x= I^a + r - 1 e. 

The matrix f" 1 is also lower triangular. 


15.4. CHAIN GRAPHS 


A chain graph includes both directed and undirected edges; however, only 
certain patterns of vertices and edges are permitted. Suppose the set of 
vertices V of the graph G = {V,E) can be partitioned into subsets V = 
V{\) U •- u V(T) so that within a subset the vertices are joined by undi¬ 
rected edges and directed edges join vertices in different subsets. Let 3^(G) 
be the set of vertices 1，…， T and let S{G) be the (directed) edge set such 
that a if and only if there is at least one element u e V(r) and at least 
one element u e V(a) such that w -► d is in E, the edge set of G. Then 
0(G) = [/ ~(G), /(G)] is an acyclic directed graph; we can define pa 级 （ t )， 
etc” for 

Let -Y t =- [X lt \u e Within a set the vertices form an undirected 

graph relative to the probability distribution conditional on the past (that is, 
earlier sets). See Figure 15.4 [Lauritzen (1996)] and Figure 15.5. 

We now define the Markov properties as specified by Lauritzen and 
Wermuth (1989) and Frydenberg (1990): 


(Cl) The distribution of X T , r= 1，…， is locally Markov with respect to 
the acyclic directed graph ^(G); that is, 


⑴ 


1 ^J^pA, y (ry 


o-endr /? (r)\pa^(r). 
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V(2) 

Figure 15.4. A chain graph. 


Hi) 



V(2) 


VO) 


Figure 15*5- The corresponding induced acyclic dirccicd graph on ^ = V(l) U V(2) U ^(3), 


(C2) For each r the conditional distribution of X T given ip (t) is globally 
Markov with respect to the undirected graph on K(r). 

(C3) 

(2) ^^^J^(t/),uef/c(T) 1 ^pa S5 (r)\pa c (f；). 

Here bd c (f/) = pa c (f/) U nb G (U). A distribution on the chain graph G that 
satisfies (Cl), (C2), (C3) is LWF block recursive Markon 

In Figure 15.6 pa ^(r) = {r - 1, r - 2} and nd 切 (T)\pa 少 (t) = {t —3，t — 
4,The set U = {u 7 w) is a set in 1/(r), and pa 0 -(f/) is the set in 
V(t— 1) U V(t- 2) that includes pa c (w) for u ^ U; that is, pa G (f/) = {x, v). 



V(T-l) V ( T ) 


Figure 15.6. A chain graph. 
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1 



J 







2 


4 


V(l) V(2) 

Figure 15.7. A chain graph. 


Andersson, Madigan, and Perlman (2001) have proposed an alternative 
Markov property (AMP), replacing (C3) by 

(C3*) 

(3) X u ^LX u \X paG(V) , uef/cl/(r), u e pa ^(r)\ P a c (f/). 

In Figure 15.6, X u for a vertex u in Kt- 2)U V(t- 1) is conditionally 
independent of X u [u^U Q V(r)] when regressed on X^ g{U) = (X x9 X y ). 
The difference between (C3) and (C3*) is that the conditioning in (C3) is on 
bd G (f/) = pa G (f/) U nb G (f/)，but the conditioning in (C3*) is on pa G (f/) 
only. See Figure 15.6. The conditioning in (C3*) is on variables in the past. 
Figure 15.7 [Andersson, Madigan，and Perlman (2001)] illustrates the differ¬ 
ence between the LWF and AMP Markov properties: 

(4) LWF: X,JLX,\X 2 .X^ X 2 JLX 3 \X v X iy 

(5) AMP: ^ JL ^ 4 |^ 21 X 2 JiX 2 \X { . 

Note that in (5) X x and are conditionally independent given X 2 ; the 
conditional distribution of X 4 depends on pa(u 2 ), but not X 3 . 

The AMP specification allows a block recursive equation formulation. 
In the example in Figure 15.7 the distribution of scalars X r and X 2 
[i; l5 u 2 ^ Kl)] can be specified as 

( 6 ) ^ 

⑺ 

where (e v e 2 ) has an arbitrary (normal) distribution. Since X 3 depends 
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directly on X x and X A depends directly on X 2 , we write 

(8) X 3 = e 3 , 

(9) X A = /3 42 X 2 + e 4 , 

where (e 3 , e 4 ) has an arbitrary distribution independent of (e 2 ), and 
hence independent of X 2 ). 

In general the AMP model can be expressed as (26) of Section 15,3. 


15.5. STATISTICAL INFERENCE 
15-5-1- Normal Distribution 

Let Xy •” x N be N observations on X with distribution XX Let x = 
N^L N a ^ { x a and S = (N-l)- ] - x)(x, -x) r = (N-]) 

一 The likelihood is 


(i) 


(27ry^ /2 \l\- N/2 e-^-^ 
= {27vy N " /2 \ir N/2 e- 






The above form shows that x and 5 are a pair of sufficient statistics for (jl 
and 2, and they are independently distributed. The interest in this chapter is 
on the dependences, which depend only on the covariance matrix S. not (jl. 
For the rest of this chapter we shall suppress the mean. Accordingly, we 

suppose that the parent distribution is MO, X) and the sample is x { __ x 7] \ 

and 5 = {\/n)YJ' a ^ x x a x' a . Tie likelihood function can be written 


( 2 ) 


{27ry pn/2 \\\ pn/2 e- 


iA5 


exp 


. 屮 ( A ) _ 了 E A,〆,, _ E A，/ 

t =1 t<f 


where A = (A /; ) = T = (t lf ) = E^ t A： a ^, and A) = {pn log(277) 

- 士 n log| A|, 

The likelihood is in the exponential family with canonical parameter A 
and statistic 7\ The maximum likelihood estimator of X with no restriction Is 
= 5 = (1 /n)T. Since A = is a 1-to-l transformation of 2, the maxi¬ 
mum likelihood estimator of A of A = 玄 — 、 
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15.5.2 - Covariance Selection Models 

In undirected graphs many of the models Involve zero restrictions on ele¬ 
ments of A. Dempster (1972) studied such models and introduced the term 
covariancr selection. When the (directed) graph satisfies the pairwise Markov 
condition, A,, = 0 for (i,/) 法五 . We assume here that the graph satisfies this 
Markov condition. Further we assume n>p\ then S is positive definite with 
probability 1. 

The likelihood function is 

P 

(3) (27r)^ /2 |Ar n/2 exp \ E A /； + E A i/^/ 

hi UJ)&E 

where A satisfies the condition A *=0, (i ， j ) 芒 £• In this form the canonical 
parameters are X pp and (/,;) The canonical variables are 

s )X , ^^ s pp and s , (ij) these form a sufficient set of statistics. To 
maximize the likelihood function we differentiate (3) with respect to A ih 
/ = 1 . p, and (/, j) ^E y to obtain the equations (4) and (5). 

Theorem 15.5.1. The maximum likelihood estimator of X in the model (3) 
is given by 

(4) 4 广 V (i,j) 

(5) A I； = 0, t ^jand (i,j) ^E, 

where A = 

This result follows from the general theory of exponential families. See 
Lauritzen (1996), Theorem 5.3 and Appendix D.l. 

Here we shall show that for a decomposable graph the equations (4) and 

(5) have a unique positive definite solution by developing an algorithm for its 
computation. We follow Speed and Kiiveri (1986). 

Theorem 15.5.2. Let L and M be pXp positive definite matrices. There 
exists a unique positive definite matrix K such that 

(6) k < t = l >i' i^jor (i,j) &E, 

(7) i 手 j and (J ， j) 篆 E, 
where (k {, ) = K ~ 1 and {m i} ) ~ Af _, . 

The proof of Theorem 15.5.2 depends on several lemmas. In the maximura 

A 

likelihood estimation L — S f Af =/ or any other diagonal matrix, and if = £. 
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To develop this subject we use the Kullback information. For a pair of 
multivariate normal distributions N(0, P) and MO, R) define 


( 8 ) 


/( / >!/?) = ^ log 


n(x\d 9 P) 
n(^|0, R) 


:• 一士 [loglPiT” +tr(/-/ > /?" 1 )]. 


Lemma 15.5.1. Suppose P and R are positive definite. Then: 

(i) I(P\R)>0 9 P 丰 R, and I{P\P)^Q. 

(ii) If {P n } and {_R rt } are sequences of positive definite matrices such that 
I(P n \R n )^0 9 then P n R^ — I. 

Proof, (i) Let the roots of |P — 化 | — 0 be < ■■- < s p . Then 

p 

(9) loglPi?- 1 ! +tr(/ - ； C(log\ + l-\) 之 0 ， 


and (9) is 0 if and only if s x — —s p —\. 

(ii) Let the roots of \P n -sR n \ = 0 be … <s p (n). Then I(P n \R n ) 

-> 0 implies [〜(>). 〜 (n)]->(l.1)，which implies that P n R^ 1 


Lemma 15.5.2. Let 

(10) P- 

Then 


尸 11 尸 12 
户 21 尸 22 


R 


^i2 

^21 ^22 


(i) The matrix 


(11) Q 


P 


ii 


尸 11 及 ll 1 昃 12 


及 21 沢 I? 尸 U 及 22 - 沢 21 及 II 1 沢 12 + 及 21 及 I? 尸 ll^Ll l ^L2 
satisfies Q n P u , Q' 2 = R u ， and Q 22 = R 22 9 where 


(12) 0 -】 


/>r.' +^ 12 (^ 22 ) -1 ^ 21 

R u 


Pu~Ru 

0 

R 21 

R 22 


0 

0 


+ R 


(ii) I(P\R)^I(P\Q)+I(Q\Rl 
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Proof, (i) Let 
(13) 



S 

R 2X 


R 21 

R 22 


Then I ~ Q~ l Q can be solved for S = + J? l2 (J? 22 ) _1 /? Jl ; Q = (g -1 ) -1 

follows from Theorem A.3.3. Then (ii) follows from 


PnRu 1 0 

and \PQ^ l \-\QR- l \ = |WrY From (13) and (14) 

(15) UPQ' 1 +tr QR' ] - tr PR ] +trl p . _ 


(14) PQ X -=PR 




QR 


Lemma 15.5.2 provides the solution to the problem of finding a matrix Q, 
given positive definite matrices P and R, such that 

(!6) q, 广 P‘ r (/,；) e{l,... ， 0, 

(17) ( 1 ,/) € {1,..., r}. 

We now develop an iterative method to find K to satisfy (6) and (7), thus 
proving Theorem 15.5.2, Suppose U … Uc m , where ..,c m are the 

m cliques of a decomposable graph G = (K, £) k Let Kq 1 = M~ ] . Define 
recursively K n = (k l} (n)) such that 

( 18 ) = I ”， “/CC/imodm ， 

(!9) k if (n)^k i] {n-l), i,j^c n modm . 

By Lemma 15.5.2, K n is uniquely determined. (The algorithm cycles through 
the cliques.) By construction 

(20) l(L\K n ^)^l(L\K n )+I(K n \K n _ x ). 

Summation of (20) from 1 to q gives 

(21) I(L\K 0 )^l(L\K q )+ 


Since I(L\K q \ > 0, EJ = ] I(K ] \K J _ l ) is bounded and I{K ; \K-^) -> 0 as n -> oo. 
The set [K~ l \l(L\K) <I(L\K 0 )} is strictly convex. 

Consider the vector sequence (Jf rm + 【， - ‘. ，欠 rm+m ) with index r (n = rm). 
It has a convergent subsequence that is, ( 艽 ⑴■⑴ +„) con¬ 
verges to say. Since 0, KjKjly ->/. Then the 
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Cl 



一一一 

°1 








— 



(i,j) t c x u c 2 


Figure 15.8 k Diagram of iind c 2 ror U c 2 = E. 

matrix K mr(l)+j K； n x r(l)+h[ ->/, 卜 2,… ， m. which implies Kf - •- - K^ n = 
K, say. Note that {i 9 j\i 9 ] ^E] satisfies ij ^c l9 /*=1，‘.，，”?• Hence K n 
satisfies (7)，and K does too. Further, k f) (mr(i) +t) satisfies 
(18) i 9 j e c t and K does, too. Figure 15.8 diagrams the sets for r, = (i, j). 
i>] ^ 1 ， .“〆，and c 2 = (/,;), = w，w + 1，…，;?， w <t‘ 

The procedure allows for construction of a multivariate normal distribu¬ 
tion with arbitrary marginal distributions over the cliques c l9 . t , 9 c m , provided 
vhat the specified marginal distributions are consistent. 

Theorem 15.5.2 provides a proof of the existence and uniqueness of the 
maximum likelihood estimators. 

The equation (12) is an updating equation. When g" 1 and R— 1 = 

{ , the entries in K" n \ x not in c,_ (1 川 remain unchanged* 

Dempster (1972) also proposes some iterative methods for finding K 
satisfying /c, 7 = / (； , (i,;) eZ), and k lf = m ”，The entropy of 
n{x\d 7 P) is 

(22) S p logn(A ： |0, P) = ~\(p\og27T- p- loglPl). 

Note that |2| = n^i(r i( \R\, where p lf ). Given that a tl — s (l9 the selec¬ 
tion of p,j to maximize the entropy of the fitted normal distribution satisfying 
the requirements also minimizes \R\ [Demspter (1972)]. 
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15.5.3. Decomposition of Covariance Selection Models 

An undirected graph is decomposable if the graph is formed by three disjoint 
sets A 9 B 9 Cj where V^AuBuC, A and B are nonempty, C separates A 
and B, and C is complete. Then if X v is globally Markov with respect to G, 
we have X A Jl 


(23) 


(24) 




A “ 

0 

^■AC 


= A = 

0 

八仙 

Aflc 



八 C/l 

^CB 

Acc 


1 


(AB)C 


乂 x 



'^AC 

^BA 



_^BC 


^cc(^ca^cb)^ 


and 

(25) 芝心 —~ 

The maximum likelihood estimator of 2 can be constructed from the 
maximum likelihood estimators of and X cc . 

If there is no restriction on 2, the maximum likelihood estimator of 2 is 


( 26 ) 




C +J ^(,4fl)C^CC^C(/lfl) 


S, 


S 


CUB) 


(Ad)C 

$cc 


S 


ABC ， 


where 


(27) 




AA C 
BAC 


s 


BC 


BBC 


^(AB)C ^ (Sac’Sbc)* 


If the restriction (25) is imposed, the maximum likelihood estimator is (26) 
with S AB , C replaced by 0 to obtain 


(28) \ 


IS 


AAC 

0 


0 


s 


+ 




BB C I \ ^BC I 

{Sca ， S cb ) 


Sch S CB ) 


S AC 

S 


BC 


s, 


cc 


The matrix S (/lfl)C has the Wish art distribution W^[2 (/4fi) . c ,n - {p A ^ p B )] 9 
where p A and p a are the number of components o<* X A and X n , respectively 
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(Section 83). The matrix B iAByC = S (AB)C S^^ conditional on ( 尤 C1 ， … ，尤 Cn ) 
=^X C has a normal distribution, the covariance of which is given by 


(29) 




and S cc has the Wishart distribution W(X CC9 n). The matrix S UB) . C and the 
matrix B (AByC9 are independent (Chapter 8). 

Consider testing the null hypothesis (25). This is testing the null hypothesis 
= ® against the alternative ^ A9 . c ^ 0. The determinant of (26) is 
I Sal ^ \ s ab^c\ * \ s cc\y the determinant of (28) is \tj - \S AA \ - \S BB \ -|S CC |. 
The likelihood ratio criterion is 






Since the sample covariance matrix S( AB )， C has the Wishart distribution 
^[^(AByc.y — (Px + Pb^ where p A and p B are the numbers of components 
of X A and X B (Section 8.2), the criterion is, in effect, u PA ， PB ， n - (PA+PB )， 
studied in Sections 8.4 and 8.5. 

As another example，consider the graph in Figure 15.9. Note that node 4 
separates (1 ， 2,3) and (5,6); nodes 1,4 separate 2 and 3; and node 4 separates 
5 and 6. These separations imply three conditional independences: 
(X v X 27 X 3 )JL(X Si X^\X A , X 2 MX 3 \(X {i X A X and X s MX b \X A . In terms of 
covariances these conditional independences are 


(31) ^(123)(56)-4 = 2(123)(56) — 乏 (123)4 ^4(56) ^ 

(32) S 23 =X 23 — X2(i4)m(U) 芝 (14)3 = 0, 


艺 56 一 艺 54 ^AA l 



Figure [5.9 
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In view of (31) the restriction (32) can be written as 

(34) ^23 U ~ 玄 23.4 、 = 0_ 

It will be convenient to reorder the subvectors as X 2 , X 3 , X ]y X 5 , X 6> X 4 to 
write 


^22^4 

… S 26 , 4 


X: 


^24 

* 

•: 

+ 

•; 

[、… \ 6 ] 

- 

^22 4 

… S M.4_ 


^64 


_v 






45 


s 


44 


The determinant of S is 

(36) |S| = |S (2 ■ 6)(2 6)4 |. IS』■ 

If the condition (X l7 X 2> X^) A(X Sj X 6 )\X 4 is imposed, the maximum likeli¬ 
hood estimator is (35) with S (l25)(56) , 4 replaced by 0 to obtain 


(37) 


The determinant of (37) is 
(38) 


(231)(23 l)M M^(56)(56)*4 I-J544I- 


S (2 ... w ..州 

(2 … 6) ^(2.,, 6)4 

夕4<2… 6) *^44 




(231)(231)-4 

0 


0 

^(56)(56) 4 


s 


4(2,"fi) 


^( 2 ,,. 6 ) 4 ^ 44 ^ 4(2 » 6 ) ^(2 , 6)4 

S 44 


- 1 

24. • ■ 164 1 44 

5 s s 

fi 6 6 
方 * : 6 4 

s s s 

a 4 » 

22!62 、《 

s s s 

t _L 

jl 

s 

5) 

3 

/1\ 
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The likelihood ratio criterion for 


123)(56)-4 


0 is 


(39) 


I 


飞 r- 


\2...6X2...6)-4| 


^(231)(231)-4 I'l ^(56)(56).4 | 


"故 (56)., 


Here U (73m56) . 4 has the distribution of U pt+p：+pi _ p ^ pi>tn _ p (Section 8.4) since 
the distribution of S (2 6yi2 6) . 4 is , 6K 二 . 6W ， n - 灼 ）， independent 

of S 44 . 

The first three rows and columns of S( 2 .. 6K2 5) . 4 constitute the matrix 


(40) 


^(231)(231)*4 


*^22-4 $23.4 .孓 21.4 

*^32 4 ^ 33-4 ^31 4 

*^12 4 *^13 4 4 


s 


22 U *^23 14 
3204 ^33 14 




31*4 

(S 12 . 4 , >5, V4 ) 


4, *^13 4) 


(S 2V4 S 

S 


31.4 


s 、 


^ ^22 U ^23-14 

*^32 14 *^33 14 


+ 


I ^2(14) 

C P(U)f 

^3(14) 


|4)(5 r | 4)2? - 




(S|2 4, ^|3-4 ) 


S: l4 , 

^31'4 

*^11 A 


The determinant of (40) is 


(41) 


^(231)f231)-l| = |S(23)(23V14 •IA1-4I. 


The estimator of S (231)(231) . 4 with X 2 X X;X x ， X A imposed is (40) with S^. u 
replace by 0 to obtain 


(42) 


S 


22 U 

0 s 


s 

s' 


21-4 


33 14 / \ ^31-4 

(•^1: 4 ， ^13-4 ) 


*^11-4(^12 4> ^13-4) 


s 


21-4 


31- 


the determinant of which is 


(43) 


I $22-14 Ms 33 . 1 J-|s 11 . 4 |. 
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The likelihood ratio criterion for ^ 0 is 


(44) 



1^(23 X 23) 14 I 

22 ul * 1^33 I 4 I 




The statistic N has the distribution of Pl ^p 2 ^ p ^p A ) (Section 8.4) 

since S t231(23) 14 has the distribution ^[ 1 (23)( 23 ). 14 > independent 

of Hi-ir 

The estimator of I l56x56) 4 with I 56 . 4 - 0 imposed is S (56)(56) . 4 with 5 56 . 4 
replaced by 0 to obtain 


(45) 


^55 4 0 

® ^66 4 


The likelihood ratio criterion for testing Z 56 , 4 



is 


(46) 


1 ^( 56 )( 56)-41 
1 ^ 55*41 " 1 ^ 66*41 


«/2 

-辦 _ 


The statistic " 加 4 has the distribution of since S (56)(56) . 4 

has the distribution ^(2 (56)(56) . 4 , ;i 一 p 4 ) independent of 

The estimator of I under three null hypotheses is (37) with S (23I)(231) . 4 
replaced by (42) and 5 (56X56) . 4 replaced by (45). The determinant of this 
matrix is 


(47) 1 玄 J 二 |5 22 , l4 | - 15 31 | 4 I * I5 n 4 | ' |S 55 , 4 | • |S 66 ‘ 4 | • ISJ . 

The likelihood ratio criterion for testing the three null hypotheses is 


(48) 


1M 

lij 




^ ("(23”(56).4"ZM4"56,4) 


*/2 


Wlicti the null hypotheses are true, the factors (7 ( 2 3 , )(56) , 4 , U 2:t . i4 , and are 
independent. Their distributions are discussed in Sections 8-4 and 8-5. In 
particular the moments of these factors are given and asymptotic expansions 
of distributions are described. 


15.54. Directed Graphs 

We suppose that the vertices are well-numbered, 1， •. • ， n; the N observations 
x 0) — are made on X = (X 1 '” " ， X p Y■ The model is (22) to (25) or 
(26) of Section 15.3. Let i =AT*" , E^ =il x (o) and S =(iV—1) 一 1 ⑷— 
x){x [a) -x)\ The model (26) consists of x x — + and n - 1 regressions 
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(23) to (25). The vector ol^ in = 0 ^ 丄 is estimated by x ]9 U pa(r : ) is 
vacuous, a 2 is estimated by x 2 : if pa(i； : ) is not vacuous and I pa ( 卜、 =^i, then 
B 2 and a 2 are estimated by 


(49) 氡 = f (〜,)-戈2)(文 1( „ 广 A )’ 

a-1 

(50) 6t 2 = x 2 +B 2 x { . 

In general 


(文 l ( tr ) -又! )(文1(« 厂又 1)' 


(51) 


忍广 [ 文 /(or) 一 A] [ X pa ( 〜 Xor)— 无 pa(4) 


N 


(52) 


•I 1C [^pa(t; ; )(a) 
\ a= I 

OLj=Xj + BjX pa ^y 


■^pa(y ; )] [ ^pa(i; ; Xa) 一 fpa(4) 


Conditional on jc pa( ^ )(a) , the distribution of these estimators is normal. 

1S.5.5. Chain Graphs 

The condition (Cl) of Section 15.4 specifies that X u iL X 0 \X paijiT for u e K(t) 
and for v e V(a) f where o-e nd^(r)\pa^(r); that is, the past earlier than 
pa 识 Or) is independent of the present. This condition corresponds to the 
Markov property in time series analysis. Thus X u is in terms of deviations 
from the regression of X T on X pi[ ^ (7) 

coX 7 \X paij>( ^ T) = ot T + B ^pa,( T) - 

The vector a T and the matrix B T are estimated as for directed graphs. 

The Markov property (C2) indicates the analysis in terms of deviations 
X T — ot T — B T X p . ] ^ lT y The estimation of the structure of dependence within 
V(t) is carried out as in Section 15.5.2. 

The Markov property (C3*) specifies X u iL for u^U Q V(t) 

and pa ^(U) U nb G (U). The property is a restriction on the regression of 

0n ^pa^(T)- 
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APPENDIX A 


Matrix Theory 


A.1. DEFINITION OF A MATRIX AND OPERATIONS 
ON MATRICES 

In this appendix we summarize some of the well-known definitions and 
theorems of matrix algebra. A number of results that are not always con¬ 
tained in books on matrix algebra are proved here. 

An m X n matrix A is a rectangular array of real numbers 

a l2 

a 2) a 22 

a ml a m2 

which may be abbreviated (a ,))，i = 1,2,m, ; = 1,2,Capital bold¬ 
face letters will be used to denote matrices whose elements are the corre¬ 
sponding lowercase letters with appropriate subscripts. The sum of two 
matrices A and B of the same numbers of rows and columns, respectively, is 
defined by 

(2) A+B = { a<j ) +(b l) ) = (a iJ +b l) ). 

The product of a matrix by a real number A is defined by 

(3) A/4 =/4 A = 

An Introduction to Multivariate Statistical Anafysis, Third Edition. By T. W. Anderson 
ISBN 0471-36091-0 Copyrighl © 2003 John Wiley & Sons, Inc. 
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These operations have the algebraic properties 


⑷ 

(5) 

( 6 ) 
⑺ 
( 8 ) 
(9) 


A. + B = B + / 4 , 

(A +C=/I + (B-hC). 
A + (- 1)/1 = (0), 

( A + fi)A = A/4 4* 

\(A +B) =Ad + Afl, 

A( fiA) — ( 


The matrix (0) with all elements 0 is denoted as 0. The operation A + (— 1JJ) 
is denoted as A — B. 

If A has the same number of columns as B has rows, that is, A = (a /; ), 
i = 1,... , ; = 1,... , m, B = (b jk \ / = 1 ， ，，. ， m ， k= 1,..., then A and B 
can be multiplied according to the rule 

/ m 、 

(10) AB = (a ;/ )(Jb jit ) = X) a,jb jk , i = 1 . /, 

\ y ■= 1 / 


that is, AB is a matrix with / rows and n columns, the element in the /th row 
and fcth column being The matrix product has the properties 

(11) (AB)C^A{BC), 

(12) A{B + C) =AB+AC, 

(13) (A +fl)C-/lC + BC. 

The relationships (11)-(13) hold provided one side is meaningful (i.e,, the 
numbers of rows and columns are such that the operations can be performed); 
it follows then that the other side is also meaningful. Because of (11) we can 
write 

(14) (AB)C=A(BC)=ABC. 

The product BA may be meaningless even if AB is meaningful, and even 
when both are meaningful they are not necessarily equal. 

The transpose of the /Xm matrix A = (a l; ) is defined to be the m X / 
matrix A f which has in the /th row and /th column the element that A has in 
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the /th row and ;th column. The operation of transposition has the proper¬ 
ties 


(15) {Ay^A, 

(16) (A^BY=A r 

(17) (AB)^BA\ 

again with the restriction (which is understood throughout this book) that at 
least one side is meaningful. 

A vector x with m components can be treated as a matrix with m rows 
and one column. Therefore，the above operations hold for vectors. 

We shall now be concerned with square matrices of the same size，which 
can be added and multiplied at will The number of rows and columns will be 
taken to be p. A is called symmetric if A— A f . A particular matrix of 
considerable interest is the identify matrix 


(18) 


1 0 0 

0 1 0 

卜 0 0 1 

,0 0 0 








o' 

0 

0 





where S ；r the Kronecker delta，is defined by 

(19) 8:1 ， 

= 0， 

The identity matrix satisfies 

( 20 ) IA^AI = A. 



We shall write the identity as I p when we wish to emphasize that it is of 
order p, Associated with any square matrix A is the determinant \A\, defined 

by 

(2i) u 卜 

i-\ 

where the summation is taken over all permutations of the set of 

integers (1”••，/?)，and /(/”，.•，is the number of transpositions required 
to change (l,...,/?) into A transposition consists of interchanging 

two numbers, and it can be shown that, although one can transform (1 ， … ， p) 
into (j 、 ， … ， j 尸 ） by transpositions in many different ways，the number of 
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transpositions required is always even or always odd, so that (—is 
consistently defined. Then 


(22) 

\AB\ = 

WHfll 

Also 



(23) 

\A\ 

Wl. 


A submatrix of /I is a rectangular array obtained from A by deleting rows 
and columns. A minor is the determinant of a square sabmatrix of A. The 
minor of an element a ti is the determinant of the sabmatrix of a square 
matrix A obtained by deleting the rth row and ;th column. The cofactor of 
say is (- 1) ,+J times the minor of a t j. It follows from (21) that 

p p 

(24) \A\ = L a jk A jk . 

If \A\ 孕 0， there exists a unique matrix B such that AB — L Then B is 
called the inverse of A and is denoted by 4 -1 . Let a hk be the element of A~ l 
in the hth row and kth column. Then 


(25) 


a hk 


^kh 


The operation of taking the inverse satisfies 

(26) =C^ l A~\ 
since 

(27) (AC)(C^ l A- l )=A(CC^ ] )A~ l ^A1A { = AA X 

Also 厂 1 = J and A~ ] A =/. Furthermore, since the transposition of (27) gives 
(A^ l yA J =I y we have = 

A matrix whose determinant is not zero is called nonsingular. If \A\ ^0, 
then the only solution to 

( 28 ) 

is the trivial one z = 0 [by multiplication of (28) on the left by /4" 1 ]. If 
\A\ - 0 S there is at least one nontrivial solution (that is, z ^ 0). Thus an 
equivalent definition of A being nonsingular is that (28) have only the trivial 
solution. 

A set of vectors , z r is said to be linearly independent if there exists 
no set of scalars c v ..^c ri not all zero, such that ^ 0. A q Xp 
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matrix D is said to be of rank r if the maximum number of linearly 
independent columns is r. Then every minor of order r H-1 must be zero 
(from the remarks in the preceding paragraph applied to the relevant square 
matrix of order r + 1), and at least one minor of order r must be nonzero ‘ 
Conversely, if there is at least one minor of order r that is nonzero, there is 
at least one set of r columns (or rows) which is linearly independent. If all 
minors of order r + 1 are zero, there cannot be any set of r + 1 columns (or 
rows) that are linearly independent, for such linear independence would 
imply a nonzero minor of order r + 1, but this contradicts the assumption. 
Thus rank r equivalently defined by the maximum number of linearly 
independent rows, by the maximum number of linearly independent columns, 
or by the maximum order of nonzero minors. 

We now consider the quadratic form 

p 

(29) x r Ax^ Yj 

where x f = jc p ) and A = (a^) is a symmetric matrix. This matrix A 

and the quadratic form are called positive semidefinite if x f Ax > 0 for all x. If 
x f Ax> 0 for all x # 0, then A and the quadratic form are called positive 
definite. In this book positive definite implies the matrix is symmetric. 

Theorem A«l.l« If C with p rows and columns is positive definite, and ifB 
with p rows and q columns, q <p，is of rank q, then B r CB is positive definite• 

Proof. Given a vector ^ # 0, let jc = By. Since B is of rank q. By 
Then 

(30) y\B f CB)y = {By)C{By) 

^x f Cx > 0 * 

The proof is completed by observing that B r CB is symmetric. As a converse, 
we observe that B r CB is positive definite only if B is of rank q, for otherwise 
there exists such that By = 0. ■ 

Corollary A.1.1. If C is positive definite and B is nonsingular，then B l CB is 
positive definite• 

Corollary A.1.2. If C is positive definite，then C~ 1 is positive definite. 

Proof. C must be nonsingular; for if Cjc = 0 for x¥= 0 ， then x l Cx— 0 for 
this x, but that is contrary to the assumption that C is positive definite. Let 
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B in Theorem A.l.1 be Then B'CB - (C -1 )'CC - 1 - (C - * 1 )', Transpos¬ 
ing CC 1 -/, we have (C-^C = {C^JC = L Thus C— 1 =(C 一 1 )，■ ■ 

Corollary A.1.3. The qXq matrix formed by deleting p - q rows of a 
posidoe definite matrix C and the corresponding p - q columns of C is positive 
definite • 

Proof. This follows from Theorem A.hi by forming B by taking the pX/> 
identity matrix and deleting the columns corresponding to those deleted 
from C. ■ 

The trace of a square matrix A is defined as tr A = The following 

properties are verified directly: 

(31) tr(A+B)=trA + trB y 

(32) trAB = tr BA. 

A square matrix A is said to be diagonal if a l} = 0, / ^y. Then M l = 
for in (24) 141 = ^\\A [V and in turn A u is evaluated similarly. 

A square matrix A is said to be triangular if a l} = 0 for i >} or alterna¬ 
tively for i < j. If a,j = 0 for i >the matrix is upper triangular, and，if 
a j} = 0 for i < } y it is lower triangular ‘ The product of two upper triangular 
matrices A 7 B is upper triangular, for the i,;th term (/>/) of AB is 
h k b kj = 0 since a,* = 0 for k < i and b kj = 0 for k >}. Similarly, the 
product of two lower triangular matrices is lower triangular. The determinant 
of a triangular matrix is the product of the diagonal elements. The inverse of 
a nonsingular triangular matrix is triangular in the same way. 

Theorem A.L2. If A is nonsingular，there exists a nonsingular lower triangu¬ 
lar matrix F such that FA = A* is nonsingular upper triangular. 

Proof. Let A Define recursively A g = (a^) = F g _ x A g __ v g = 2 、 … ， p 、 
where F g ^ 1 = has elements 


(33) 

/// 一 1, = 1 ， 

] = 1. - - • * /?» 

(34) 

ah- ii 

/■(g- i) _ w 1 

i = g p ， 

(35) 

// 广 = 0, 

otherwise. 
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Then 


(36) 

<、 = 0 ， 

i=j + l,...,p, j ■ 

=i，...，g - 1 ， 

(37) 


i = 1 ， ‘ • • ， g - 1 ， 

j=h …， p. 

(38) 



!‘，_/‘ — S i • • • ■> P • 

Note that F = F p ” F { is lower triangular and the elements of A g in the 

first g - 1 columns below the diagonal are 0; in particular A* — FA is upper 
triangular. From \A\ ^ 0 and = 1, we have \A g ^ x \ ^ 0. Hence 


, aH、: are different from 6 and the last p —g columns of A g ^ x can 
be numbered so ^ ^ 0; then is well defined. ■ 


The equation FA ^/4* can be solved to obtain A = LR ，where is 

upper triangular and L =F -1 is lower triangular and has Ts on the main 
diagonal (because F is lower triangular and has l’s on the main diagonal). 
This is known as the LR decomposition. 

Corollary A.1.4« If A is positive definite, there exists a lower triangular 
nonsingular matrix F such that FAF’ is diagonal and positive definite. 


Proof, From Theorem A.1.2, there exists a lower triangular nonsingular 
matrix F such that FA is upper triangular ar d nonsingular. Then FAF f is 
upper triangular and symmetric; hence it is diagonal. ■ 

Corollary A.1,5, The determinant of a positiue definite matrix A is positive. 

Proof. From the construction of FAF\ 




" fl U) 

U u 

0 

0 - 

•• 0 



0 


0 

.■ 0 

(39) 

R4F f = 

0 

0 

a 33 

•• 0 



0 

0 

0 ‘ 

‘• a( p J 
pp 


is positive definite，and hence a { g 8 ^ > 0, g = 1， …， /?， and 0 < | FAF f | — |F) - 
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Corollary A-1.6* If A is positive definite，there exists a lower triangular 
matrix G such that GAG r — /. 

Proof. Let FAF r = D 2 , and let D be the diagonal matrix whose diagonal 
elements are the positive square roots of the diagonal elements of D 2 . Then 
C = D^ l F serves the purpose. ■ 

Corollary A,1J (Cholsky Decomposition). If A is positive definite，there 
exists a unique lower triangular matrix T (t" = 0, i < j) with positive diagonal 
elements such that A = 7T\ 

Proof. From Corollary A. 1.6, A = G -1 (G0—i，where G is lower triangular* 
Then T== G— 1 is lower triangular. ■ 

In effect this theorem was proved in Section 12 for A = W\ 

A.2. CHARACTERISTIC ROOTS AND VECTORS 

The characteristic roots of a square matrix B are defined as the roots of the 
characteristic equation 

(1) \B-\I\ -0. 

Alternative terms are latent roots and eigenvalues. For example, with 



we have 

(2) |B- A/| = A 2 . =25-4-10A + A 2 = A 2 -10A + 21. 

The degree of the polynomial equation (1) is the order of the matrix B and 
the constant term is |B|. 

A matrix C is said to be orthogonal if C'C 1 = J; it follows that CC f = L Let 
the vectors r' = (x l7 ...,x p ) and 〆 =( 少 represent two points in a 
p-dimensional Euclidean space* The distance squared between them is 
D(x, y)-(x- yJix — y). The transformation z== Cx Can be thought of as a 
change of coordinate axes in the /?-dimensional space* If C is orthogonal, the 
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transformation is distance-preserving, for 

(3) D(Cx,Cy)-(Cy-Cx) l (Cy-Cx) 

= ( ， - Jf)'C’C(_v -i) = (_V -■O'(n) =D(.r, J-). 

Since the angles of a triangle are determined by the lengths of its sides, the 
transformation z- Cx also preserves angles. It consists of a rotation together 
with a possible reflection of one or more axes, We shall denote yx^x by \\x\l 

Theorem X2.1. Giucm any symmetric matrix B, there exists an orthogonal 
matrix C such that 

f 0 … 0 ^ 

0 d 2 … 0 

(4) C r BC = Z) = « * 

0 0 … d n 

p I 

IfB is positive semidefinite, then d } > 0, i = 1 ，… . ， p; if B is positive definite，then 
d t > 0, / = 

The proof is given in the discussion of principal components in Section 
11.2 for the case of B positive semidefmite and holds for B symmetric. The 
characteristic equation (1) under transformation by C becomes 

(5) 0-|Cr|B-A/MC| - \C f (B-\I)C\ 

=IC^C-A/| =|D-A/| 
dr 入 0 … 0 

0 d 2 ~ \ … 0 p 

: : :=n(d 「 A) ‘ 

* i « / = i 

0 0 … ci p -\ 

Thus the characteristic roots of B are the diagonal elements of the trans¬ 
formed matrix D. 

If \ is a characteristic root of B, then a vector x y not identically 0 
satisfying 

(6) (B = A 1 /)x, = 0 

is called a characteristic vector (or eigenvector) of the matrix B corresponding 
to the characteristic root Any scalar multiple of x f is also a characteristic 
vector. When B is symmetric, - A,/) = 0. If the roots are distinct, 
x , J Bx l = 0 and x , ; x l = 0, i j. Let c ! = (l/\\x } \\)x l be the ith normalized 
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characteristic vector, and let C = (c 15 ...,c p X Then C'C = / and BC = CD. 
These lead to (4)‘ If a characteristic root has multiplicity m, then a set of m 
corresponding characteristic vectors can be replaced by m linearly indepen¬ 
dent linear combinations of them. The vectors can be chosen to satisfy (6) 
and XjXf = 0 and x , ^Bx i = 0, / ¥= 

A characteristic vector lies in the direction of the principal axis (see 
Chapter 11). The characteristic roots of B are proportional to the squares of 
the reciprocals of the lengths of the principal axes of the ellipsoid 

(7) l 

since this becomes under the rotation y = Cx 

p 

(8) 1 =y'Dy = Y. 

/=* l 

For a pair of matrices A (nonsingular) and B we shall also consider 
equations of the form 

(9) \B-\A\ =0. 

The roots of such equations are of interest because of their invariance under 
certain transformations* In fac ， for nonsingular C, the roots of 

(10) |d —A(d)| =0 
are the same as those of (9) since 

(11) \C l BC- \C f AC\ = \C f (B-\A)C\ - \C f \ ^\B -\A\-\C\ 
and \C f \ =|C| # 0. 

By Corollary A* 1.6 we have that if /I is positive definite there is a matrix 
E such that E f AE = /. Let E’BE = B*. From Theorem A.2.1 we deduce that 
there exists an orthogonal matrix C such that C f B*C = D, where D is 
diagonal. Defining EC as F, we have the following theorem: 

Theorem A-2-2. Given B positive semidefinite and A positive definite，there 


exists a nonsingular matrix F such that 




'A, 

0 ‘‘ 

• 0 、 

(12) F'BF = 

0 

a 2 

. 0 



0 •' 


(13) FAF = I, 




where 入 【 k …之 ( 之 0) a 阳咖 roots of ⑼‘ If B is positive definite t A, > 0 
i = 1 ， 
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Corresponding to each root A ; there is a vector satisfying 

(14) 

and = L If the roots are distinct x , J Bx l = 0 and x , jAx l - 0, i ^]\ Then 
F = (JC|,,.., jc p ). If a root has multiplicity m, tnen a set of m linearly 
independent x/s can be replaced by m linearly independent combinations of 
them. The vectors can be chosen to satisfy (14) and x f jBx f = 0 and x , J Ax / ― 0, 
/ 旬 ‘ 

Theorem A.2,3 (The Singular Value Decomposition). Given an nXp 
matrix X, n >p、there exists an n Xn orthogonal matrix P, a p Xp orthogonal 
matrix Q, and an nXp matrix D consisting of a p 乂 p diagonal positive 
semi definite matrix and cm (n — p) Xp zero matrix such that 

(15) X^PDQ. 

Proof. From Theorem A.2.1, there exists a p Xp orthogonal matrix Q and 
a diagonal matrix E such that 

( 16 ) QX'XQ'-^ ®), 

where E { is diagonal and positive definite. Let XQ f ^= Y - (Y { Y 2 ), where the 
number of columns of y i is the order of E x . Then = 0 , and hence 
Y l - 0 . Let P { = Y { Ey K Then P[ = /. An nXn orthogonal matrix 尸 = 
(P { P : ) satisfying the theorem is obtained by adjoining P 2 to make 
P orthogonal. Then the upper left-hand corner of D is and the rest of D 
consists of zeros. ■ 

Theorem A.2.4, Let A be positive definite and B be positive semidefinite. 
Then 

…、 , x ， By ： 

(17) 

where Aj and X p are the largest and smallest roots of (1 )， and 

Y f Dy 

( 18 ) 

where Aj and k p are the largest and smallest roots of (9). 

Proof. The inequalities (17) were essentially proved in Section 11.2, and 
can also be derived from (4). The inequalities (18) follow from Theorem 
A. 2.2. ■ 
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A square matrix A is idempotent if A 2 — A. If \ satisfies \A - A/| =0, 
there exists a vector jc ¥= 0 such that 入尤 = 办 = /1 2 jc. However, A 2 x —A(Ax) 
=^A\x^ \ z x. "Hius A 2 = 入 ， and A is either 0 or 1. The multiplicity of 入 =1 is 
the rank of 4Af A is p X p 7 then A is idempotent of rank p - (rank A), 
and A and I p ~-A are orthogonal. If A is symmetric, there is an orthogonal 
matrix 0 such that 


(19) 


OAO' 



0 

0 


0(1-A)0' 


A.3. PARTITIONED VECTORS AND MATRICES 


Consider the matrix A defined by (1) of Section A.l. Let 


A u = (a h ), 

匁 12= 

^2Z ~ ( a ")， 


£ — 1— 1,..., 

/ = p + l ， ... ， m，/ = 
i = p 4 - j 二 q + 1 ，…， n • 


Then we can write 

( 2 ) 


f A n A a ^ 

、 / 4 21 A 22 j 


We say that A has been partitioned into submatrices /!, 广 Let B (mXn) be 
partitioned similarly into submatrices B lj7 i,j = 1,2. Then 


( 3 ) 


^ ^ g _ + 4l2 + 忍 12 、 

^/4 2l 4- B 2i A 22 + ^22 ) 


Now partition C (nX r) as 


( 4 ) 


C 


(c y 

c 


21 


"22 


where C u and C t2 have q rows and C u and C 2l have s columns. Then 


( 5 ) 


AC 


fA t 

A 


21 


^12 

乂 22 


G C l2 、 

^21 ^22 j 


+ ^12 ^21 

、 ^ 21^11 + ^22 ^21 


^ 11^12 + ^ 12^22 
+ ^22^22 
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To verify this，consider an element in the first p rows and first s columns of 
AC. The /，/th element is 

n 

(6) H a, k c kj , i<.p, }<s. 


This sum can be written 


⑺ 




n 


^rk^kj ^ 

fc = 1 + 1 


The first sum is the x, yth element of /1 U C U , the second sum is the i,;th 
element of A U C 2U and therefore the entire sum (6) is the /,;th element of 
A n C l{ +*/tj 2 C 2l . In a similar fashion we can verify that the other submatrices 
of AC can be written as in (5). 

We note in passing that if A is partitioned as in (2)，then the transpose of 
A can be written 


( 8 ) 

If A l2 0 and A 2l 


A f 


^21 

\ A， i2 ^22 


0, then for A positive definite and A n square, 


( 9 ) 




22 


The matrix on the right exists because A n and are nonsingular. That the 
right-hand matrix is the inverse of A is verified by multiplication: 


( 10 ) 



0 

/ J - 1 

^11 

0 ' 


I o' 



|0 

\ 

^22 , 


0 ^ 


which is a partitioned form of 7 /r 
We also note that 


(ii) 


A, 


22 




A 


22 


= \^u\-\ A 


l 22 I 


The evaluation of the first determinant in the middle is made by expanding 
according to minors of the last row; the only nonzero element in the sum is 
the last, which is 1 times a determinant of the same form with 1 of order one 
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less. The procedure is repeated until |/4 n | is the minor. Similarly, 

( 12 ) 


^11 ^12 


I 0 


^11 ^12 

0 A 22 


0 a 22 


0 I 


= l^ll|-|^22l- 

A useful fact is that if A { of q rows and p columns is of rank q, there 
exists a matrix A 2 of p — q rows and p columns such that 


(13) 


A 


A, 


is nonsingular. This statement is verified by numbering the columns of A so 
that A [{ consisting of the first q columns of A x is nonsingular (at least one 
qXq minor of A ^ is different from zero) and then taking A 2 as (0 /); then 


: 1 A j n \=\^i 


(14) \A\ 

which is not equal to zero. 


Theorem AJ.l. Let the square matrix A be partitioned as in (2) so that A 22 
is square. If A 12 is nonsingular, let 


(15) 

Then 

(16) BA 


B 


f 7 

^A l2 A 2 2 

5 

c = 

/ 

0 、 

10 

/ 、 



、 * _ ^22 / ^2 l 



(/4 h — A [2 A 22 l A 2l 


A 


21 


A 


22 I 


AC 


( A U ~ A i2 A 22 A 2\ A 


A 


12 


22 


(17) BAC 

If A is symmetric, C = B’. 


0 A. 22 


Theorem A.3.2. Let the square matrix A be partitioned as in (2) so that A ：2 
is square. If A 22 is nonsingular, 


( 18 ) 


1^1 =|^H - A U A 22 A 2t M^2 ： l- 
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Proof. From Theorem A.3.1, 

( 21 ) A = 

Hence 


A~'=C\ Au2 ° B 

\ ® ^22 I 


i I 0 、 

Uu l 2 o 、 

/ *^^12^22 ^ 

\ 22^21 f } 

\ 0 a- 22 ' ^ 

l 0 J 


Multiplication gives the desired result. ■ 

Corollary A.3.2. !fx，= (W ^ (2), ), then 

(23) x f A^ l x^ 〜 l 2 (x (1 ) - /l 12 /lA( 2 〉） +>:( 2 )'/^ 2 1 文 ( 2 ). 

Proof. From Lhc theorem 

(24) 

x ' 1 x f A^~ { A A ^ 1 

/■ 人 a. 11 2 ^ ** j j ** j2 **22 ^ 

— JC ( ' ) ， / 4 2 2 1 / 42 1 / 4 n ! 2 X ： (1) + -^ (2)， (^22 ^ 21 ^ 1I - 2 '^ 12 / ^ 22 1 + ^22 
which is equal to the right-hand side of (23). ■ 


Proof. Equation (18) follows from (16) because |B| = 1. ■ 

Corollary A.3.1. For C nonsingular 


(19) 


C y 
y' i 




l y 

少 C 




Theorem A.3.3. Let the nonsingular matrix A be partitioned as in (2) so that 
/ 1 ： 2 is square. IfA 22 is nonsingular, let A lt . 2 =/4 n — Th e ^ 


( 20 ) 


A 


\ -/1 22 1/1 21 /1 ll l 2 /4 22 i /4 2 )/l 11 1 2 /4 12 /4 2 2 l +A 22 i 


-A n \A n A^ 


c 


o 


/l 


o 
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Theorem A.3A Let the nonsingular matrix A be partitioned as in (2) so that 
A 22 is square. If A 22 is nonsingular ， 

(25) (-4 22 ^ ^22^2\{^U ^ ^12^22 ^2\) ^12^22 + ^12 ' 

Proof. The lower right-hand corner of A^ { is the right-hand side of (25) 
by Theorem A.3.3 and is also the left-hand side of (25) by interchange of 1 
and 2. ■ 


Theorem A.3,5, Let U be pXm. The conditions for I p - UU\ l m — ITV, 
and 

'i P f/' 

， v 

to be positive definite are the same. 


(26) 


Proof. We have 


(27) w f ) 



= v ! v + V l Uw + + w'w 


=〆（4 一 UU f )v^ (t/V + w)\V ! v^w). 


The second term on the right-hand side is nonnegative*, the first term is 
positive for all r # 0 if and only if l m — U f U is positive definite. Reversing 
the roles of v and w shows that (26) is positive definite if and only if 
I p — UU f is positive definite. ■ 


A.4. SOME MISCELLANEOUS RESULTS 

Theorem A.4.1. Let C be pXp, positive semidefinite，and of rank r (</?). 
Then there is a nonsingular matrix A such that 


⑴ 


ACA == 




Proof. Since C is of rank r, there is a (p — r) Xp matrix A 2 such that 


( 2 ) 

Choose B (r X p) such that 

(3) 


A 2 C= 0. 
[B \ 
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is nonsingular. Then 

(4) (yw . 2) =(r i\ 

This matrix is of rank r, and therefore BCB 1 is nonsingular. By Corollary 
A.1.6 there is a nonsingular matrix D such that DiBCB^D 1 = I r . Then 


( 5 ) 




DB S 

(D 

o| 

(B 

A > 



\ A 2j 


is a nonsingular matrix such that (1) holds ‘ _ 


Lemma A.4.1. If E is p Xp, symmetric, and fionsingular, there is a nonsin¬ 
gular matrix JF such that 

( 6 ) F£F， = (o 7 M 

where the order of I is the number of positive characteristic roots of E and the 
order of —I is the number of negative characteristic roots of E. 


Proof. From Theorem A.Z1 we know there is an orthogonal matrix G 
such that 


⑺ 


GEG '= 


'h l 0 
0 h 2 

0 0 


… 0 
0 



where h x > ■- >h q >0 > … >h p are the characteristic roots of E 

Let 


i/A 

0 

( 8 ) . 


Q Q 

o 

0 
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Then 

(9) KGEG'K' = {KG)E{KCi)' = _®J. ■ 

Corollary A.4_l. Let C be p Xp, symmetric, and of rank r (</>l Then 
there is a nonsingular matrix A such that 

I 0 01 

(10) ACA = 0 -/ 0 , 

\0 0 0/ 

where the order of I is the number of positive characteristic roots of C and the 
order of —I is the number of negative characteristic roots、the sum of the orders 
being 

Proof. The proof is the same as that of Theorem A4.1 except that Lemma 
A,4.1 is used instead of Corollary A, 1,6. ■ 

Lemma A.4.2. Let A be n X m (n > m) such that 

V 

(11) A f A^I ftr 

There exists an nX(n- m) mahix B such (licit (A B) is' orthogonal 

Proof. Since A is of rank m, there exists an n X (/? 一 m) matrix C such 
that (A C) is nonsingular. Take D as C 一 AA’C; then D r A — 0, Let E 
[(n — m) X (n - m)] be such that E r D f DE = L Then B can be taken as D£. 


Lemma A.4.3. Let x be a vector of n components. Then there exists an 
orthogonal matrix O such that 

( 12 ) 

where c = 4x^x . 

Proof. Let the first row of (9 be (1 /c)x\ The other rows may be chosen in 
any way to make the matrix orthogonal. ■ 



Lemma A.4_4. 

(13) 


Let B = (b u ) be a p X p fnrtrix. Then 
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Proof. The expansion of |fi| by elements of the ith row is 
(14) 151- Lb lh B sh . 

h — 1 

Since B lh does not contain b lJ9 the lemma follows- ■ 


Lemma A4.S. Let ^ ^ }J (c v ...,c n ) be the i,;th element of a pXp 
matrix B. Then for g = 1 ， • 4 • ， / 1 ， 


(15) 


E 


Theorem A.4.2. 


冲 ih〈 C …， ^/?) _ Y 1 叩 ! h ( 亡 i■ 3 ) 

~db^H dCo 匕 . ih dF. 

IfA=A '， 


(16) 

(17) 


da n 



如 I 


2/1 "， 



Proof. Equation (16) follows from the expansion of \A\ according to 
elements of tile ith row. To prove (17) let b lf = h )t = a”, i,j p, i <j. 

Then by Lemma A.4.5, 


(18) 


d\B\ 

da u 


B ” + 


Since \A\ = |B| and B v — B )t — A t) = A }l , (17) follows. 


■ 


Theorem A.4.3. 

(19) -^(x'Ax) = 2Ajc, 

where d/dx denotes taking partial derivatives with respect to each component ofx 
and arranging the partial derivatives in a column. 


Proof. Let A be a column vector of as many components as x. Then 

(20) (x + h) f A(x + h) =x ! Ax + h 1 Ax + x f Ah + h 1 Ah 

=x Ax+2h r Ax + h f Ah. 

The partial derivative vector is the vector multiplying h l in the second term 
on the right. _ 

DefinUion A.4.1, Let A = (a i} ) be a pXm matrix and B = be a qXn 
matrix. The pq X mn matrix with a t) b a 卩 as the element in the i, ath row and the 



A.4 SOME MISCELLANEOUS RESULTS 


643 


}, fith column is called the Kronecker or direct product of A and B and is 
denoted by A® B\ that is ， 


( 21 ) 


a u B 

a 21 B 

A ® B — . 


a (2^ 




a P 2 B ■•- a p „Bj 


Some proj^rties are the following when the orders of matrices permit the 
indicated operations: 


(22) (A ^B)(C®D) ^(AC) ®(BD) 9 

(23) (A =A~ l <SB~\ 

Theorem A,4A Let the /th characteristic root of A (pXp) be X f and the 
corresponding characteristic vector be x x — (x ll9 • • • ， x pt )\ and let the ath root of 
B (q X q) be v a and the corresponding characteristic vector be y a , a = 1， . ■ ■ ，士 
Then the i, ath root of A (SiB is and the corresponding characteristic vector 
is x, ®y a = {x u y' a ,. ..,x pl i= a=\,...,q. 

Proof. 


a u B 

(24) (A®B)( Xl ®y a )= : 


a i p B ] 


' x uyJ 

a PP B 


x P ,y a 


1 

) 



T.a pj x Sl By a 



Kx u By a ' 




^i X pt^ya 



x Pi y a 


■ 
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Theorem A*4.5 

(25) \A^B\ = \A\ q \B\ p . 

Proof. The determinant of any matrix is the product of its roots ； therefore 

(26) = n n =(n i n ^ 

/ = 1 a™ t \ /-= l \ cr™ l 

Definition A.4.2. If the pXm matrix /4 = (fl 1? ..,, a m \ then vec A = 

Some properties of the vec operator [e.g., Magnus (1988)] are 

(27) vec ABC == (C f )vec B, 

(28) vec xy' - y 

Theorem A.4.6. The Jacobian of the transformation E = Y^ ] (from E to Y) 
is \ Y\ ~ 2p , where p is the order of E and Y. 

Proof. From EY = /, we have 

(29) (钟 +E(> 卜0， 

where 

^i P 

~le~ 

^ e pp 

~dF 

Then 

(31) O 卜 '0 卜 -^( 钟 ' 

If B=y a fi ，then 

- III . w\ 
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where e a/? is a p X p matrix with all elements 0 except the element in the 
ath row and /3th column, which is 1; and e. a is the ath column of E and 
is its ^3th row，Thus 3e l} /dy a ^ = —q a e 抝 . Then the Jacobian is the determi¬ 
nant of a p 2 X p 2 matrix 

= 抝 1= = |£r|£f = |£| : "= in— 2 ' ■ 

Theorem 尤 4.7, Let A and B be symmetric matrices with charactensfic roots 
a ] >a 2 > > a p and b { >b 2 > *** > b p , respectively, and let H be a p X p 

orthogonal matrix. Then 

p p 

(34) max trHAH B= min HA 1 H , B= E 

H / =i H ；-i 

Proof. Let A =H a D a H a and B = H b D b H’ b , where H a and H b are orthog¬ 
onal and D a and D b are diagonal with diagonal elements a l9 . ,. y a p and 
bp … ， b p respectively. Then 

(35) = tr H*AN*，B = max tr r H h D h H^ 

=max tvH h H^H a D (l {HlH*H a ) D h 

H* 

=maxtr HD fl H , D b , 

H 

where H = H b H*H a . We have 

(36) tr HD a H'D b = j ： (KD a H') n b, 

/ = 1 

=E E b > + l )+b p j ： {HD^H'),, 

/ = 1 / — 1 ) >= t 

P_ 1 I p 

^ E EWU + b Ei 

/ = I )~\ ) = t 

=E fl A 

i = t 


(33) mod 


^i } 


by Lemma A.4.6 below. The minimum in (34) is treated as the negative of the 
maximum with B replaced by -B [von Neumann (1937)]. ■ 
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Lemma A.4,6. Let P = (p i} ) be a doubly stochastic matrix (S ： 0, 
l\ } *■= IX Lety Y ^y 2 > - >y p . Then 


(37) 


k k n 


E E/v” 


k = 1，••” p. 


Proof. 

(38) 


k P P 

E E p^y, = Eg ,，.， 


where = Ef„, p, r / = 1,..., p (0 <g ; < 1, Efyg 广 k). Then 
( 39 ) E g,y) - E>j = - E>j Egj + E 

；"=1 ；-1 J = I \ /= L / /= I 

k p 

=E (>j-^)(g/- i) + E (y 「 yk)g) 

)=1 j=k+l 


< 0. 


■ 


Corollary A.4-2. Let A be a symmetric matrix with characteristic roots 


flu > > •" a p . Then 


(40) 

Proof. In Theorem A.4.7 let 


max UR f AR = Y1 

i=i 


(41) 

Theorem A.4.8. 


B 


4 o 


■ 


(42) \I + xC\ = 1 +xtrC+ 0(x 2 ). 

Proof. The determinant (42) is a polynomial in x of degree p ； the 
coefficient of the linear term is the first derivative of the determinant 
evaluated at jc = 0. In Lemma A.4.5 let n = 1, c { =x, /3^(jc)= S Ih +xc lh> 
where 8 n = 1 and 8 lh = 0, / ^ ft. Then d(i ih (x)/dx = c ih> B u - 1 for x = 0, and 
B lh = 0 for jr = 0, / ^ h. Thus 


(43) 




■ 
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A.5. GRAM-SCHMIDT ORTHOGONAUZATION AND THE 
SOLUTION OF LINEAR EQUATIONS 


A.5.1. Gram-Schmidt Orthogonalization 

The derivation of the Wishart density in Section 7.2 included the 
Gram—Schmidt orthogonalization of a set of vectors; we shall review that 
development here. Consider the p linearly independent n-dimensional vec¬ 
tors (p < Define = v {9 

v[w } 

(!) n L i = 2 ,..., P . 

f=i IIh^H 

Then ^ 0, z = l ， ___ ， p ， because v u ... 9 v p are linearly independent，and 
w\wj = 0, i ^as was proved by induction in Section 7.2. Let u ( = (l/||>v.||V., 
i = 1 ， …， p. Then u v ^. y u p are orthonormal; that is, they are orthogonal and 
of unit length. Let U = u p \ Then UV = /. Define t u = ||h^|| (> 0 )， 

(2) = || ⑷ || ~ 7 = 1 ， • ‘ • ， ’ - 1 ， ^ = 2, •“ ， p ， 


and / 7 = 0, / = / + 1,..., p, i = 1,..., ~ 1. Then T ^ (t i} ) is a lower triangu¬ 

lar matrix. We can write (1) as 


f — 1 l y 

(3) + E (〜) u 广 i = \,...,p, 

J= l 

that is, 

(4) y=(v [9 ... y v p ) = UT\ 

Then 

(5) A = VV=TUUT f = rT f 


as shown in Section 7.2. Note that if V is square，we have decomposed an 
arbitrary nonsingular matrix into the product of an orthogonal matrix and an 
upper triangular matrix with positive diagonal elements ； this is sometimes 
known as the QR decomposition. The matrices U and T in (4) are unique. 


These operations can be done in a different order. Let V= ({ ， ••• ， t^ 0) ). 
For k = I，…， p — 1 define recursively 

(7) t )k = r u k , j = k + l”“ ， p ， 

(8) = l j = k+l”" ， p. 
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Finally t pp = || and u p = (l/t pp )v^ p ~ l K The same orthonormal vectors 

and the same triangular matrix (〜) are given by the two proce¬ 
dures. 

The numbering of the columns of V is arbitrary. For numerical stability it 
is usually best at any given stage to select the largest of 1) || to call t kk . 

Instead of constructing w i as orthogonal to w l9 . ^ 9 w^ l9 we can equiva¬ 
lently construct it as orthogonal to ， … ，巧 Let w { = v v and define 

( 9 ) ^ = v i + Lf ^； 

/ =i 

such that 

i-l 

( 10 ) 0 = v r h w { = v f h v { + 

/ =i 

= a hl + E «/,//,；» ^ = 1. 卜 1 _ 

；=( 

Let where /„ = 1 and f t) = 0, i <j. Then 

(11) W=(w t ,...,w p ) = VF'. 

Let D, be the diagonal matrix with ||〜|卜 r ;; as the yth diagonal element. 
Then U = WD^~ { = VF'D^ 1 . Comparison with V — UT’ shows that F = DT~K 
Since A — TT\ we see that FA —DT r is upper triangular. Hence F is the 
matrix defined in Theorem A.1.2. 

There are other methods of accomplishing the QR decomposition that 
may be computationally more efficient or more stable. A Householder matrix 
has the form H = I n — 2aot’，where ot'a = 1， and is orthogonal and symmet¬ 
ric. Such a matrix H { (i.e., a vector a) can be selected so that the first 
column of H X V has O^s in all positions except the first, which is positive. The 
next matrix has the form 



The (n - l)-component vector a is chosen so that the second column of H { V 
has all 0，s except the first two components, the second being positive. This 
process is continued until 

(13) H p _ { …// 2 ///=[厂]， 



A.5 ORTHOGONALIZATION AND SOLUTION OF LINEAR EQUATIONS 


649 


where T is upper triangular and 0 is (n - p) X p. Let 

(14) H =H { ••• H p ] = (ff (,) Z /( 2 ))， 

where H {[) has p columns. Then from (13) we obtain V= Since the 

decomposition is unique, H {1) = U. 

Another procedure uses Givens matrices. A Givens matrix G I; is I except 
for the elements g u = cos 6 = and g i} = sin 0 = -g )h i ^It is orthogonal. 
Multiplication of V on the left by such a matrix leaves all rows unchanged 
except the ith and ;th; 6 can be chosen so that the /. jth element of G X) V 
is 0. Givens matrices G 21 ,...,G nl can be chosen in turn so G n] G 2 y has 
all 0’s in the first column except the first element, which is positive. Next 
G 32 , ..,G n2 can be selected in turn so that when they are applied the 
resulting matrix has 0’s in the second column except for the first two 
elements. Let 

(15) G' = G' 2l - G' nl G、 2 ". G；, p _ I = (G^" G( 2 )). 

Then we obtain 

( 16 ) V=G' J =G (1 )r ， 
and G (,) = U. 

A«5.2. Solution of Linear Equations 

In the computation of regression coefficients and other statistics, we need to 
solve linear equations 

(17) Ax = y, 

where A is pXp and positive definite. One method of solution is Gaussian 
elimination of variables, or pivotal condensation. In the proof of Theorem 
A. 1.2 we constructed a lower triangular matrix F with diagonal elements 1 
such that FA =A* is upper triangular. If Fy =y* 9 then the equation is 

(18) 

In coordinates this is 

=y*- 

/-I 


(19) 
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Let c/f^ = =y*/ci* n j = i, i p, i = Then 

p 

(-0) H** - E O; 

； =i+ I 

these equations are to be solved successively for x p9 x p ^ l9 ... 9 x v The calcula¬ 
tion of FA =/!* is known as the forward solution, and the solution of (18) as 
the backward solution. 

Since FAF f - ，尸 =D 2 diagonal, (20) is A^x -=，*，where = D_ 2 A* 
and 少 ** = D~ 2 y*. Solving this equation gives 

(21) x = A**~ ] y til * 

The compulation is 

( 22 ) x^F{ …… 

The multiplier of y in (22) indicates a sequence of row operations which 
yields A 

The operations of the forward solution transform A to the upper triangu¬ 
lar matrix A*. As seen in Section A.5.1，the triangularization of a matrix can 
be done by a sequence of Householder transformations or by a sequence of 
Givens transformations. 

From FA = A*, we obtain 

(23) ⑷ = U^'i\ 

)=1 

which is the product of the diagonal elements ot 4*，resulting from the 
forward solution. We also have 

(24) //!-〜 = (f» ， D - 2 (f>) ^y* l D- 2 y* 

= ，，广 

The forward solution gives a computation for the quadratic form which 
occurs in T 1 and other statistics. 

For more on matrix computations consult Golub and Von Loan (1989). 



APPENDIX B 


Tables 


TABLE B.1 

Wilks* Likeuhood Criterion ： Factors C ( p , m, M ) 
to Adjust to m 、where Af*=/i-p + l 

5% Significance Level 


P-3 


M\m 

2 

4 

6 

8 

10 

12 

14 

16 


1 

1.295 

1.422 

1.535 

V632 

1.716 

1.791 

1.857 

1.916 

1.971 

2 

1.109 

1.174 

1.241 

1.302 

1.359 

1.410 

1.458 

1.501 

1.542 

3 

1.058 

1.099 

1.145 

1.190 

1.232 

1.272 

1.309 

1.344 

1.377 

4 

1.036 

1.065 

1.099 

1.133 

1.167 

1.199 

1.229 

1.258 

1.286 

5 

1.025 

1.046 

1.072 

L100 

1.127 

1.154 

1.179 

1.204 

1.228 

6 

1.018 

1.035 

1.056 

1.078 

1.101 

1.123 

L145 

1-167 

U88 

7 

L014 

1.027 

1.044 

1.063 

L082 

1.101 

1.121 

1.139 

1.158 

8 

1.011 

1.022 

1.036 

1.052 

1.068 

1.085 

1.102 

1119 

1.135 

9 

1.009 

1.018 

1.030 

1.043 

1.058 

1.073 

1.088 

1.102 

1117 

10 

1.007 

1.015 

1.025 

1.037 

1.050 

1.063 

1.076 

1.089 

1.103 

12 

1.005 

1.011 

1.019 

1.028 

1.038 

L048 

L059 

1.070 

1.081 

15 

1.003 

1.008 

1.013 

1.020 

1.027 

1.035 

1.043 

1.052 

1.060 

20 

1.002 

1.004 

1.008 

1.012 

1.017 

1.022 

1.028 

1.034 

1.040 

30 

1.001 

1.002 

1,004 

1.006 

1.009 

1.011 

1.015 

1.018 

1.021 

60 

1.000 

1.001 

1.001 

1.002 

1.002 

1.003 

1.004 

1.006 

1.007 

00 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1,000 

X.pm 

12.5916 

21.0261 

28.8693 

36.4150 

43.7730 

50.9985 

58.1240 

65.1708 

72.1532 
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TABLE B.l (Continued) 


5% Significance Level 


M\m 

20 

:3 

22 

2 

4 

6 

/? = 4 

8 

■ 

12 

14 

1 

2.021 

2.067 

1.407 

1.451 

1.517 

1.583 

1.644 

1.700 

1.751 

2 

1.580 

1.616 

1.161 

1.194 

1.240 

1.286 

1.331 

1.373 

1.413 

3 

1.408 

L438 

L089 

1.114 

1.148 

1.183 

1.218 

L252 

1.284 

4 

1.313 

1.338 

1.057 

1.076 

1.102 

1.130 

1159 

1.186 

L213 

5 

1.251 

1.273 

1.040 

1.055 

1.076 

1.099 

1.122 

L145 

1.168 

6 

1.208 

1.227 

1.030 

1.042 

1.C59 

1.078 

1.097 

1.118 

1.137 

7 

1.176 

1.193 

1.023 

1.033 

1.047 

1.063 

1.Q80 

1.097 

1.115 

8 

1,151 

1.167 

1,018 

1.027 

1.038 

L052 

1.067 

1.082 

1.097 

9 

1.132 

1.147 

1.015 

1.022 

1.032 

1.044 

1.057 

1.070 

1.084 

10 

1.116 

1.129 

1.012 

1.018 

1,027 

1.038 

L049 

1.061 

1.073 

12 

1.092 

1.103 

1.009 

1.014 

1.020 

1.029 

1.038 

1.047 

1.058 

15 

1.069 

1.078 

1.006 

1.009 

L014 

1.020 

1.027 

1.035 

1.042 

20 

1.046 

1.052 

1.003 

1.006 

1.009 

1.013 

1.017 

1.022 

1.027 

30 

1.025 

1.029 

1.002 

1.003 

1.004 

1.006 

l._ 

1.011 

1,014 

60 

1.008 

1.009 

1.000 

L001 

1,001 

1.002 

1.003 

1.003 

1.004 

1.000 

00 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

Y 2 
八 pm 

79.0819 

85.9649 

15.5073 

26.2962 

36.4150 

46.1943 

55.7585 

65.1708 

74.4683 


TABLE B.l (Continued) 


M\m 

5% Significance Level 

16 

/? «= 4 

18 

國 

2 

4 

P : 

6 

- 5 

8 

■ 

12 

1 

1.799 

1.843 

L884 

1.503 

1.483 

1.514 

1.556 

1.600 

1.643 

2 

1.450 

1.485 

1.518 

1.209 

1.216 

1.245 

1.280 

1.315 

1.350 

3 

1.314 

1.343 

1.371 

U20 

1,130 

1154 

1.182 

1.211 

1.240 

4 

1.239 

1.264 

1.288 

1.079 

1.089 

1.108 

1.131 

1.155 

1.179 

5 

1.190 

1.212 

1.233 

1.056 

1.065 

1.081 

1.100 

U20 

1.141 

6 

1.157 

1.176 

1.194 

1.042 

1.050 

1.063 

1.079 

1.097 

1.114 

7 

1.132 

U49 

1,165 

1.033 

1.040 

1.051 

1.065 

1,080 

1.095 

8 

1.113 

1128 

1.143 

1.026 

1.032 

1.042 

1.054 

1.067 

1,081 

9 

1.098 

1.111 

1.125 

1.022 

1,027 

1,035 

1.Q46 

1.057 

1,070 

10 

1.086 

1.098 

1.110 

1,018 

1.023 

1.030 

1.039 

1.050 

1.061 

12 

1.068 

1.078 

1.088 

1.013 

1.017 

1.023 

1.030 

1.038 

1.047 

15 

1.050 

1.058 

1.066 

1.009 

1.011 

1.016 

1.021 

1.028 

1.034 

20 

1.033 

1.039 

1.045 

1.005 

1.007 

1.010 

1.013 

1.018 

1.022 

30 

1.018 

1.021 

1.024 

1.002 

1.003 

1.005 

1.007 

1.009 

1.012 

60 

1.005 

1.007 

1.008 

1.001 

1.001 

1.001 

1.002 

1.003 

1.004 

00 

1.000 



1.000 


1.000 


L000 

1.000 

X.pnt 

83.6753 

92.8083 

101.879 

18.3070 

31.4104 43.7730 

55.7585 

67.5048 

79.0819 
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TABLE B. 1 (Continued) 



5% significance Level 

M\m 

P 

14 

* 5 

16 1 

2 

6 

p — 6 

8 

10 

12 

2 

7 

4 

1 

1,683 

1.722 

1.587 

L520 

1.543 

1,573 

1.605 

1.662 

1 550 

2 

1.383 

1.415 

1.254 

1.255 

1.279 

1.307 

1.335 1 

1.297 

1.263 

3 

1.267 

1.294 

1.150 

1,163 

L184 

1.208 

1.232 

1.178 

1.165 

4 

1.203 

1.226 

1.100 

1,116 

LI 34 

1.154 

1.175 

1,121 

1.116 

5 

1161 

1181 

1.072 

1.088 

1.103 

1.120 

1.138 

1.089 

1.087 

6 

1.132 

U50 

1.055 

1.069 

1.082 

1.097 

1.113 

LQ68 

L068 

7 

1.111 

1.127 

1 加 

1.056 

1.068 

1.081 

1.095 

1.054 

1.055 

8 

1.095 

1.109 

1.035 

1.046 

1.057 

L068 

1.081 

1.044 

1.045 

9 

1.082 

L095 

1,029 

1.039 

1.048 

L059 

1.070 

1.036 

1,038 

10 

1.072 

1.083 - 

1.024 

1X)34 

1.042 

1.051 

1.061 

1.031 

1.032 

12 

1.057 

1.066 

1.018 

1.025 

L032 

1.Q40 

1,048 

1.023 

1.024 

15 

1.042 

1.049 

1.012 

1.018 

1.023 

1.029 

1.035 

1.016 

1.017 

20 

1.027 

1,033 

1.007 

1.011 

L014 

1.018 

1.023 

1 1.010 

LOU 

30 

1.014 

1.018 

1.003 

1.006 

1.007 

1.010 

1.012 

1.005 

L005 

60 

1.004 

1.006 

1.001 

L002 

1.002 

1.003 

1.004 

1.001 

1.001 

00 

1.000 

1.000 

1.000 

LOGO 

1.000 

1.000 

1.000 

1.000 

1.000 

Xpm 

90.5312 

101.879 

21.0261 

50.9985 

65.1708 

79.0819 

92.8083 

23.6848 

41.3371 


TABLE B.l (Continued) 



5% Significance Level 



p 編 "} 



8 


9 


p = 10 

M\m 

6 

8 

10 

2 

8 

2 

4 

6 

2 

1 

1.530 

1.538 

1.557 

1,729 

1.538 

1.791 

L614 

1.558 

1.847 

2 

1.266 

1.282 

003 

1.336 

1.288 

1.373 

L309 

1.293 

1.408 

3 

1.173 

U89 

1.208 

1.206 

LI 95 

1.232 

1.201 

U96 

1.257 

4 

1.124 

1.139 

1.155 

L142 

1.144 

1.162 

1.144 

1.144 

1.182 

5 

1.095 

1.108 

1122 

1.105 

1.113 

1.121 

U10 

1.112 

1.137 

6 

1.075 

1.086 

1，099 

1.081 

1.091 

1.094 

L088 

1.090 

1,107 

7 

1.062 

1.071 

1.083 

1.065 

1.076 

1.076 

1.071 

1.074 

1,087 

8 

1.051 

1.060 

1.070 

1.053 

1.064 

1.062 

1.060 

1.062 

1,072 

9 

1.043 

1.051 

1.060 

1.044 

1.055 

1.052 

1.050 

1,053 

1.061 

10 

1.037 

LOU 

1,053 

1.038 

1.048 

1.045 

1.043 

1.046 

1.052 

12 

1.029 

1.034 

1.042 

1.028 

1.038 

L034 

1,033 

1.035 

1,039 

15 

1.020 

1.024 

1.031 

1.019 

1.027 

1.023 

1,023 

1,025 

1028 

20 

1.013 

1.016 

L019 

1,012 

1.017 

1,014 

1.015 

1.016 

1.017 

30 

1.006 

1.008 

L010 

1,006 

1.009 

1.007 

L007 

1.008 

1.009 

60 

1.002 

1.002 

1.003 

1,001 

1.003 

1,002 

1.002 

1,002 

1.002 

00 

1.000 

1.000 

1.000 

1.000 

1.000 

LOOO 

1.000 

1.000 

1,000 

X^m 

58.1240 

74.4683 

90,5312 

26.2962 

816753 

28.8693 

50.9985 

72.1532 

- 3L4104 
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TABLE B. 1 (Continued) 


M\m 

1% Significance Level 

2 

4 

6 

P = 

8 

= 3 

10 

12 

14 

16 

1 

1,356 

1.514 

L649 

1.763 

1.862 

1.949 

2.026 

2.095 

2 

1.131 

1.207 

1.282 

1.350 

1.413 

1.470 

1.523 

1.571 

3 

1.070 

1.116 

1.1(57 

1.216 

1.262 

1.306 

1.346 

1.384 

4 

1.043 

1.076 

1.113 

1.150 

1.187 

1.221 

1.254 

1.285 

5 

1.030 

1.054 

1.082 

1.112 

1.141 

1.170 

1.198 

1.224 

6 

1.022 

1.040 

1.063 

1.087 

1.112 

1,13(5 

L159 

1.182 

7 

； 1.016 

1.031 

1.050 

1.070 

1.091 

1.111 

1.132 

1.152 

8 

1.013 

1.025 

1.041 

1.058 

1.075 

1.093 

1111 

1.129 

9 

1.010 

1.021 

1.034 

1.048 

1.064 

1.080 

L095 

1.111 

10 

1.009 

1.017 

1.028 

1.041 

1.055 

1,069 

1,082 

1.097 

12 

1.006 

1.012 

1.021 

1.031 

1.042 

1.053 

1.064 

1.076 

15 

1.004 

1.009 

1,014 

1.021 

1.030 

1.038 

1.047 

1.056 

20 

1.002 

1.005 

1.009 

1.013 

1.019 

1.024 

1.030 

1.036 

30 

1.001 

1.002 

1.004 

1.007 

1.009 

1.012 

1.016 

1.019 

60 

1.000 

1.001 

1.001 

1.002 

1.003 

1.004 

1.005 

1.006 

00 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

X pm 

16.8119 

26.2170 

34.8053 

42.9798 

50.8922 

58.6192 

66.2062 

73.6826 


TABLE B. 1 {Continued) 



1% Significance Level 



/? = 3 




P 

m 4 


M\m 

18 

20 

22 

2 

4 

6 

8 

10 

1 

2.158 

2.216 

2.269 

1.490 

1.550 

1.628 

1.704 

1.774 

2 

1.616 

1.657 

1.696 

1.192 

1.229 

1.279 

1.330 

1.379 

3 

1.420 

1.453 

1.485 

L106 

1.132 

1.168 

L207 

1.244 

4 

U15 

1.344 

1.371 

1.068 

1.088 

1.115 

1.146 

1.176 

5 

1.249 

1.274 

1.297 

1.047 

1.063 

1.085 

1.109 

1.134 

6 

1.204 

1.226 

1.246 

1.035 

1.048 

1.066 

1.086 

1.107 

7 

1.171 

1.190 

1.209 

1.027 

1.037 

1.052 

1.070 

1.088 

8 

1.146 

1.163 

1.180 

1.021 

1.030 

1.043 

1.053 

1.073 


1.127 

1.142 

1.157 

1.017 

1.025 

1.036 

1.048 

1.062 

10 

Ull 

1.125 

1.139 

1.014 

1.021 

1.030 

L041 

1.054 

12 

1.087 

1099 

1.110 

1.010 

1.015 

1.023 

1.031 

1.041 

15 

1.065 

1.074 

1.083 

1.007 

L010 

L016 

1.022 

1.029 

20 

1.043 

1.049 

1.056 

1.004 

1006 

1.010 

1.014 

1.019 

30 

1.023 

1.027 

1.031 

1.002 

1.003 

1.005 

1.007 

1.009 

60 

1.007 

1.009 

1.010 

1.000 

1.001 

1.001 

1.002 

1.003 

00 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

X pm 

81.0688 

88.3794 

95.6257 

20.0902 

31.9999 

42.9798 

53.4858 

63.6907 
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TABLE B.l (Continued) 



1% Significance Level 




~ 4 




p~ S 


M\m 

12 

14 

16 

18 

20 

2 

4 

6 

1 

1.838 

1.896 

1.949 

1.999 

2.045 

1.606 

1.589 

1.625 

2 

1.424 

1.467 

1.507 

1.545 

1.580 

L248 

1.253 

1.284 

3 

1.280 

1.314 

1.347 

1.378 

1.408 

L141 

LI 50 

1.175 

4 

1.205 

1.234 

1.261 

1-287 

1.313 

1.092 

1.101 

1.121 

5 

1.159 

1.183 

1.207 

1.230 

1.252 

1.065 

1.074 

1.090 

6 

1.128 

1.149 

1.169 

1.189 

1.208 

1.049 

1.056 

1.070 

7 

1.106 

1.124 

1.142 

1.160 

1.177 

1.038 

1.044 

1.056 

8 

1.089 

U05 

1121 

1.137 

1.153 

1.031 

1.036 

1.046 

9 

1.076 

1.091 

1.105 

1.119 

1.133 

L025 

1.030 

1.039 

10 

1.066 

1.079 

1.092 

1.105 

1.118 

1.021 

1.025 

1.033 

12 

1.051 

1.062 

1.073 

1.083 

1,094 

1.015 

1.019 

1.025 

15 

1.CH7 

1.045 

1.053 

1.062 

1.071 

1.010 

1.013 

1.017 

20 

1.024 

1.029 

1.035 

1.041 

1.047 

1.006 

1.008 

1.011 

30 

1.012 

1.015 

1.019 

1.022 

1.026 

1.003 

1.004 

1.005 

60 

1.004 

1.005 

L006 

1.007 

1.008 

1.001 

1.001 

L001 

00 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

X.pm 

73.6826 

83.5134 

93.2168 

102.8168 

112.3292 

23.2093 

37.5662 

50.8922 


TABLE B.l (Continued) 





1% Significanc 

c Level 







- 5 




p ^ 6 


M\m 

8 

10 

12 

14 

16 

2 

6 

8 

1 

1.672 

1.721 

1.768 

UU 

1.855 

1.707 

1.631 

1.656 

2 

1.321 

1.359 

U96 

1.431 

1.465 

1.300 

1.294 

U19 

3 

1.204 

1.235 

1.265 

1.294 

U23 

1.175 

1.183 

1.205 

4 

1.145 

1.171 

1.196 

1.221 

1.245 

1.116 

1.129 

1.148 

5 

1.110 

1.131 

1.153 

1.174 

1.196 

1.084 

1.097 

U13 

6 

1.087 

1.105 

1.124 

1.143 

1.161 

1.063 

1.076 

1,090 

7 

1.071 

1.087 

1.103 

1.119 

1.136 

1.050 

1.061 

1.074 

8 

f 1.059 

1.073 

1.087 

U02 

1.116 

1.040 

1.051 

1.062 

9 

1.050 

1.062 

1.075 

1.088 

1.101 

1,033 

1.043 

1.052 

10 

1.043 

1.054 

1.065 

1.077 

1.089 

1.028 

1.037 

1.045 

12 

1.033 

1.041 

1.051 

1.060 

1.070 

1.021 

1.028 

1.035 

15 

1.023 

L030 

1.037 

1.044 

1.052 

1.014 

1.020 

1.024 

20 

1.015 

1,019 

1.024 

1.029 

1.034 

1.008 

1.012 

1.015 

30 

1.007 

1.010 

1,012 

1.015 

1.019 

1.004 

1.006 

1.008 

60 

1.002 

1.003 

1.004 

1.005 

1.006 

1.001 

1.002 

1.002 

00 

1.000 

1.000 

1.000 

1,000 

1.000 

1.000 

L000 

1.000 

X.pm 

63.6907 

76.1539 

88.3794 

100.425 ： 

112.329) 

26.2170 

58.6192 

73.6826 
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TABLE B.l {Continued) 




1% Significance Level 


P ■ 

= 6 



p 1 



M\m 

10 

12 

2 

4 

6 

8 

10 

1 

1.687 

1.722 

1.797 

1.667 

1.642 

1.648 

1.666 

2 

1.348 

1.378 

1,348 

1.305 

1,306 

1.321 

U42 

3 

1,230 

1.255 

1.207 

U88 

1.194 

1.210 

1.229 

4 

1.169 

1.191 

1.140 

1.130 

1,138 

1.152 

1.169 

5 

U31 

1.150 

1.102 

1.097 

1.105 

1.117 

1.132 

6 

U06 

U22 

1.078 

1.076 

1.083 

1.094 

1.107 

7 

1.087 

1.102 

1.062 

1.061 

1.067 

1.077 

1,089 

S 

1,074 

1.086 

1.050 

1,050 

1.056 

1.065 

1.075 

9 

1.063 

1.075 

1.042 

1.042 

1.047 

L055 

1.065 

10 

1.055 

1.065 

1.035 

1.036 

1.041 

1.048 

1.056 

12 

1.042 

1.051 

1.026 

1.027 

1.031 

1.037 

1.044 

15 

1.030 

1.037 

1.018 

1.019 

1022 

1.025 

1.032 

20 

1.019 

1.024 

l.oll 

1.012 

1.014 

1.017 

1.020 

30 

1.010 

L013 

1.005 

1.006 

1.007 

1.00^ 

1.011 

60 

1.003 

1.004 

1.001 

1.002 

1.002 

1.003 

1.003 

00 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

1.000 

X.pm 

88.3794 

102.816 

29.1412 

48.2782 

66.2062 

83.5134 

100.425 


TABLE B. 1 ( Continued) 


1% Significance Level 


M\m 

P 

2 

8 

8 

2 

p - 9 

4 

6 

尸 m 10 

2 

1 

1.879 

1.646 

1.953 

1.740 

1.671 

2.021 

2 

1.394 

1.326 

1.436 

1.355 

1.333 

1.476 

3 

1.238 

1.215 

1.267 

1.226 

1.218 

1.296 

4 

U63 

1.158 

1.185 

U61 

1.158 

1.207 

5 

1.120 

1.123 

1.138 

1.122 

U22 

1.155 

6 

1,092 

1.099 

1.107 

.1.096 

1.098 

1.121 

7 

1,074 

1.082 

1.086 

1.078 

1.080 

1.098 

S 

1.060 

1.069 

1.070 

1.065 

1.067 

1.081 

9 

1,050 

1.059 

1.059 

1.055 

1.058 

L068 

10 

1.043 

1.051 

1.050 

1.047 

1.050 

1.058 

12 

1.032 

1.040 

1.038 

1.036 

1,037 

L044 

15 

1.022 

1.028 

1.026 

1.026 

1.027 

1.031 

20 

1.013 

1.018 

1.016 

1.016 

1.017 

1.019 

30 

1.007 

1.009 

1.008 

1.008 

1.009 

1.010 

60 

1.002 

1.003 

1.002 

1.002 

1.003 

1.003 

« 

1.000 

LOOO 

1,000 

1.000 

1,000 

1.000 

X.pm 

31.9999 

93.2168 

34,8053 

58.6192 

81.0688 

37 5662 
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TABLE B.2 

Tables of Significance Points for the Lawley-Hotelling Trace Test 

_ Pr OM~ a _ 

5% Significance Level 

p-2 


n\m 

2 

3 

4 

5 

6 

8 


12 

15 

2 

9.859* 

10.659* 

11.098* 

11.373* 

11.562* 

11.804* 

11.952* 

12.052* 

12.153* 

3 

58428 

58.915 

59.161 

59.308 

59.407 

59.531 

59.606 

59.655 

59.705 

4 

23.999 

23.312 

22.918 

22.663 

22.484 

22.250 

22.104 

22.003 

21.901 

5 

15.639 

14.864 

14.422 

14.135 

13.934 

13.670 

13.504 

13.391 

13.275 

6 

12.175 

11.411 

10.975 

10.691 

10,491 

10.228 

10.063 

9.949 

9.832 

7 

10.334 

9.594 

9.169 

8.893 

8.697 

8.440 

8.277 

8.164 

8.048 

8 

9.207 

8.488 

8.075 

7.805 

7.614 

7.361 

7.201 

7.090 

6.975 

10 

7.909 

7.224 

6.829 

6.570 

6.386 

6.141 

5 . 卯 4 

5.875 

5.761 

12 

7,190 

6.528 

6.146 

5.894 

5.715 

5.474 

5.320 

5.212 

5.100 

14 

6.735 

6.090 

5.717 

5.470 

5.294 

5.057 

4.905 

4.798 

4,686 

18 

6.193 

5.571 

5.209 

4.970 

4.798 

4.566 

4.416 

4.309 

4.198 

20 

6.019 

5.405 

5,047 

4.810 

4.640 

4.410 

4.260 

4.154 

4.042 

25 

5.724 

5.124 

4.774 

4.542 

4.374 

4.147 

3.998 

3 892 

3.780 

30 

5.540 

4.949 

4.604 

4.374 

4.209 

3.983 

3.835 

3.729 

3.617 

35 

5,414 

4.829 

4,488 

4.260 

4.096 

3.872 

3.724 

1618 

3,505 

40 

5.322 

4,742 

4.404 

4178 

4.014 

3.791 

3.643 

3.538 

3.42.S 

50 

5.198 

4.625 

4.290 

4.066 

3.904 

3.682 

3.535 

3.429 

3.315 

60 

5.118 

4.549 

4.217 

3.994 

3.833 

3.611 

3.465 

3.359 

3.245 

70 

5.062 

4.496 

4.165 

3.944 

3.783 

3.562 

3.416 

3.310 

3.1% 

80 

5.020 

4.457 

4.127 

3.907 

3.747 

3.526 

3.380 

3.274 

3.159 

100 

4.963 

4.403 

4.075 

3.856 

3.696 

3,476 

3.330 

3.224 

3.109 

200 

4.851 

4.298 

3,974 

3.757 

3.598 

3.380 

3.234 

3.127 

3.012 

oo 

4.744 

4.197 

3.877 

3.661 

3.504 

3.287 

3.141 

3.035 

2.918 


•Multiply by 10 2 . 
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TABLE B.2 (Continued) 


5.754 

5.621 

5.529 

5.461 

5.369 

5.191 

5.023 


2.776 十 
2992* 
69.244 
34,070 
22.517 
17.191 
14.229 

11.120 

9,541 

8.597 

7.533 

7.206 


2.844 t 

2.891 + 

2.952 ， 

2.980 

3.014^ 

3.039 t 

2.994* 

2.995* 

2.996* 

2.997* 

2.997* 

2.998* 

68.116 

67.337 

66.332 

65.712 

65.290 

64.862 

33.121 

32.465 

31.615 

31.088 

30.729 

30.364 

21.706 

21.143 

20.413 

19.958 

19.648 

19.332 

16.469 

15.967 

15.313 

14.905 

14.626 

14.341 

13.567 

13.106 

12.504 

12.127 

11.868 

11.603 

10.531 

10.121 

9.582 

9.243 

9.01J 

8.769 

8.996 

8.615 

8.113 

7.796 

7.577 

7.351 

8.082 

7.720 

7.242 

6.939 

6.729 

6.511 

7.053 

6.714 

6.265 

5.979 

5.780 

5.572 

6.736 

6.406 

5.966 

5,685 

5.489 

5.284 

6.217 

5.899 

5.476 

5.204 

5.013 

4,813 

5.903 

5.593 

5.180 

4.914 

4.726 

4.529 

5.692 

5.389 

4.982 

4.720 

4.535 

4.339 

5.542 

5.243 

4.841 

4.582 

4.398 

4.204 

5.341 

5.048 

4.653 

4.397 

4.216 

4.023 

5.214 

4.924 

4.534 

4.280 

4.100 

3.908 

5.125 

4.838 

4.451 

4.199 

4.020 

3.829 

5.061 

4.775 

4.391 

4,140 

3.961 

3.770 

4.972 

4.690 

4.308 

4.059 

3.881 

3.691 

4.803 

4.525 

4.150 

3.903 

3.727 

3.538 

4.642 

4.369 

4.000 

3.757 

3.582 

3.393 


2 

2.46 7 + 

2,(567 十 

3 

2.985* 

2.990* 

4 

74.275 

71.026 

5 

38.295 

35.567 

6 

26.118 

23.794 

7 

20.388 

18.326 

8 

17.152 

15.268 

10 

13.701 

12.038 

12 

11,920 

10.388 

14 

10.844 

9.399 

18 

9.617 

8.278 

20 

9.236 

7.932 

25 

8.604 

7.360 

30 

8.219 

7.013 

35 

7.959 

6.780 

40 

7J73 

6.613 

50 

7.523 

6.389 

60 

7.363 

6.247 

70 

7.252 

6.148 

80 

7171 

6.075 

100 

7.059 

5.976 

200 

6.843 

5.785 

oo 

6.638 

5.604 


t Multiply bv 10 4 
^Multiply by 10* 


1% Significance Level 
P^2 

2 3 4 5 6 8 10 12 15 


6 9 0 4 
6 3 2 
6 3 * 1 » 9 
6 6 * 6 * 5 . 
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TABLE B,2 (Continued) 


5% Significance Level 


n\m 

3 

4 

5 

6 

= 3 

8 

■ 

12 

15 


3 

1 25.930* 

26.996* 

27.665* 

28.125* 

28.712* 

29.073* 

29.316* 

29.561* 

29.809 

4 

1.188* 

1.193* 

1.196* 

i.m* 

1.200* 

1202* 

1,203* 

1.204* 

1.205 c 

5 

42.474 

41.764 

41.305 

40.983 

40.562 

40.300 

40.120 

39.937 

39.750 

6 

25.456 

24.715 

24.235 

23.899 

23.458 

23.182 

22.992 

22.799 

22.600 

7 

18.752 

18.056 

17.605 

17.288 

16.870 

16.608 

16.427 

16.241 

16.051 

8 

15.308 

14.657 

14.233 

13.934 

13.540 

13.290 # 

13118 

12.941 

12.758 

10 

11*893 

11.306 

10.921 

10.649 

10.287 

10.057 

9.897 

9.732 

9.560 

12 

10.229 

9.682 

9.323 

9.068 

8.727 

8.509 

8.357 

8,198 

8.033 

14 

9,255 

8.736 

8.394 

8.149 

7.822 

7.612 

7.465 

7.311 

7,150 

16 1 

8.618 

8.118 

7.788 

7.553 

7.236 

7.031 

6.887 

6J36 

6.577 

18 

8.170 

7.685 

7.364 

7135 

6.825 

6.624 

6.483 

6.334 

6.177 

20 

7.838 

7.365 

7.051 

6.826 

6.522 

6.325 

6.185 

6.038 

5.882 

25 

7294 

6.841 

6.539 

6.323 

6.029 

5.837 

5.700 

5.556 

5.401 

30 

6.965 

6.524 

6.231 

6.020 

5,732 

5.543 

5.409 

5.265 

5.112 

35 

6.745 

6.313 

6.025 

5.818 

5.534 

5.348 

5.214 

5.072 

4.919 

40 

6.588 

6.162 

5.878 

5.673 

5.393 

5.208 

5.076 

4.934 

4.781 

50 

6.377 

5.961 

5.682 

5.481 

5.205 

5.022 

4.891 

4.750 

4.597 

60 

6.243 

5.832 

5.558 

5.359 

5.086 

4.904 

4.774 

4.633 

4.480 

70 

6,150 

5.744 

5.471 

5.274 

5.003 

4,823 

4.693 

4.553 

4,399 

80 

6.082 

5.679 

5,408 

5.212 

4.943 

4,763 

4.634 

4.493 

4.339 

100 

5.989 

5.590 

5.322 

5.128 

4.860 

4.682 

4.552 

4.413 

4.258 

200 

5.810 

5.419 

5.156 

4.965 

4.702 

4.525 

4.397 

4.257 

4.102 

00 

5.640 

5.256 

4.999 

4.812 

4.552 

4.377 

4.250 

4.110 

3.954 


f Multiply by 10 2 . 
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TABLE B.2 (Continued) 


1% Significaice Level 
P ' 3 



3 

4 

5 

6 

8 

■■ 

12 

15 

■1 

3 

6.484^ 

6.750 r 

6.917+ 

7.031 + 

7.178 卞 

7.267 卞 

7.328 r 

7.38^ 

7.451 1 " 

4 

5.990* 

5.995* 

5.998* 

6.000* 

6.002* 

6.003* 

6.005* 

6.006* 

6.007* 

5 

1.274* 

1.242* 

1.222* 

1.208* 

U90* 

1.179* 

1.172* 

1.164* 

U56_ 

6 

59.507 

57.032 

55.462 

54377 

52.973 

52.102 

51,509 

50.906 

50.292 

7 

37.994 

35.993 

34J21 

33.840 

32.695 

31.984 

31,498 

31.002 

30.496 

8 

28.308 

26.599 

25.511 

24.755 

23.771 

23.157 

22.737 

22.308 

21.868 

10 

19.737 

18*355 

17.471 

16.855 

16.050 

15.544 

15.197 

14.840 

14.472 

12 

15.973 

14.765 

13.990 

13.448 

12.737 

12.288 

11.978 

11.659 

11.328 

14 

13.905 

11803 

12.096 

11.599 

10.945 

10.530 

10.243 

9.946 

9.638 

16 

12.610 

11.581 

10.918 

10.452 

9.836 

9.444 

9.172 

8.890 

8.596 

18 

11.729 

10.751 

10.120 

9.676 

9.087 

8.712 

8.450 

8.178 

7.893 

20 

11.091 

10.152 

9.545 

9.117 

8.549 

8.186 

7.932 

7.668 

7.390 

25 

10.075 

9.201 

8.634 

8.233 

7.699 

7.356 

7.115 

6.803 

6.596 

30 

9.479 

8.644 

8.102 

7.718 

7.205 

6.874 

6.641 

6.395 

6.135 

35 

9.087 

8.280 

7.755 

7.382 

6.883 

6.560 

6.332 

6.091 

5.834 

40 

8.811 

8.023 

7.511 

7.146 

6.650 

6.339 

6.115 

5.877 

5.623 

50 

8.448 

7.686 

7.189 

6,836 

6*360 

6.050 

5.831 

5.597 

5.346 

60 

8.220 

7.474 

6.988 

6,642 

6.174 

5.870 

5.653 

5.422 

5.172 

70 

8.063 

7.329 

6*850 

6.509 

6.047 

5.746 

5.531 

5,302 

5.053 

80 

7.948 

7.224 

6.750 

6,412 

5.955 

5.656 

5.443 

5.215 

4.967 

100 

7.793 

7.081 

6.614 

6.281 

5.830 

5.534 

5.323 

5.096 

4.850 

200 

7.498 

6.808 

6.356 

6.032 

5.593 

5.304 

5.096 

4.873 

4.627 

00 

7.222 

6*554 

6.116 

5.801 

5.373 

5.089 

4.885 

4.664 

4.419 


tMultiply by 10 4 . 
* Multiply by 10 2 . 
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TABLE B. 2 (Continued) 


5% Significance Level 
p = 4 


rt\m 

4 

5 

6 

8 

10 

12 

15 

20 

25 

4 

49.964* 

51,204* 

52.054* 

53,142* 

53.808* 

54.258* 

54.71* 

55.17* 

55.46* 

5 

1.996* 

2.001* 

2.005* 

2.009* 

2.011* 

2.013* 

2.015* 

2,016* 

Z017* 

6 

65J15 

64.999 

64.497 

63.841 

63.432 

63.151 

62.866 

62.573 

62.396 

7 

37.343 

36.629 

36,129 

35.474 

35.064 

34.782 

34.495 

34.200 

34.019 

8 

26.516 

25.868 

25.413 

24,814 

24.437 

24.178 

23.912 

23.639 

23.471 

10 

17,875 

17,326 

16.938 

16.424 

16.098 

15.872 

15.640 

15.399 

15,250 

12 

14J38 

13,848 

13.500 

13.037 

12.741 

12.535 

12.321 

12.099 

11.961 

14 

12,455 

12.002 

11‘680 

11.248 

10.972 

10,778 

10,577 

10.366 

10.234 

16 

11,295 

10.868 

10.563 

10.154 

9.890 

9,705 

9.512 

9J09 

9.181 

18 

10.512 

10,104 

9.812 

9,419 

9.165 

8.986 

8.798 

8.600 

8.475 

20 

9,950 

9,556 

9.274 

8.893 

8.645 

8 471 

8,287 

8,093 

7,970 

25 

9.059 

8.688 

L422 

8.062 

7.826 

7.659 

7.482 

7,293 

7.173 

30 

8.538 

8.182 

7.927 

7.578 

7.350 

7.188 

7.015 

6.829 

6.710 

35 

8.197 

7.852 

7.603 

7.263 

7.040 

6.880 

6.710 

6.526 

6.408 

40 

7.957 

7.619 

7.375 

7.041 

6.821 

6.664 

6.495 

6.313 

6.195 

50 

7.640 

7.313 

7.075 

6.750 

6.535 

6.380 

6.214 

6.033 

5.916 

60 

7.442 

7120 

6.887 

6.568 

6.356 

6.203 

6.038 

5,858 

5.740 

70 

7.305 

6.988 

6.758 

6.443 

6.232 

6.081 

5.917 

5,738 

5.620 

80 

7.206 

6.892 

6.665 

6.351 

6.143 

5.992 

5.829 

5.650 

5.532 

100 

7.071 

6.762 

6.537 

6.228 

6.021 

5.872 

5.710 

5.531 

5.413 

200 | 

6,814 

6.514 

6.295 

5.993 

5.791 

5.644 

5,484 

5.305 

5.186 

00 

1 6,574 

6.282 

6.069 

5.774 

5.576 

5.431 

5.272 

5.094 

4.974 


♦Multiply by 10 2 . 
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TABLE B.2 (Continued) 


1% Significance Level 


p = 4 


n\m 

4 

5 

6 

8 

■■ 

12 

15 


25 

4 

12 娜 

12.80(^ 

13.012 t 

13.283 f 

13.449 + 

13.561 ， 

■ 13.67 f 

13.79t 

13.87^ 

5 

9.999* 

10.004* 

10.008* 

10.012* 

10.014* 

10.016* 

10.018* 

10.02* 

10.02* 

6 

1.938* 

1.906* 

1.885* 

1.857* 

1.840* 

1.828* 

1.816* 

1.804* 

1.797* 

7 

85.053 

82.731 

81.125 

79.047 

77.759 

76.882 

75.989 

75.082 

74.522 

8 

51.991 

50.178 

48.921 

47.290 

46.276 

45.583 

44.877 

44.156 

43.715 

10 

29.789 

28.478 

27.566 

26.376 

25.632 

25.121 

24.597 

24.060 

23.731 

12 

21.965 

20.889 

20.138 

19.154 

18.534 

18,108 

17.668 

17.215 

16.936 

14 

18,142 

17.199 

16.539 

15.670 

15.121 

14.742 

14.349 

13.943 

13*691 

16 

15.916 

15.059 

14.457 

13.662 

13.157 

12.807 

12.444 

12.066 

11.831 

18 

14.473 

13.(574 

13,112 

12.368 

11.894 

11.564 

11,221 

10.863 

10.639 

20 

13.466 

12.710 

12.177 

11.470 

11.018 

10.703 

10.374 

10.030 

9.814 

25 

11.924 

11.237 

10.751 

10.103 

9.687 

9.395 

9.089 

8.766 

8.562 

30 

11.055 

10.409 

9.951 

9338 

8.943 

8.665 

8.372 

8.060 

7.863 

35 

10,499 

9.880 

9.440 

8.851 

8.470 

8.200 

7.915 

7.611 

7.418 

40 

10114 

9.514 

9.087 

8.514 

8.142 

7.879 

7.600 

7.301 

7.110 

50 

9.614 

9,040 

8,631 

8.079 

7.720 

7.465 

7.194 

6.902 

6.713 

60 

9.305 

8.747 

8.319 

7.311 

7.460 

7.210 

6.943 

6.655 

6.468 

70 

9.095 

8,549 

8.158 

7.630 

7.284 

7.037 

6.774 

6.488 

6.301 

80 

8.944 

8.405 

8.020 

7.498 

7.157 

6.912 

6.651 

6.367 

6.181 

100 

8.739 

8.211 

7.833 

7.321 

6.985 

6.744 

6.486 

6.204 

6.019 

200 

8.354 

7.848 

7.484 

6.990 

6.664 

6 429 

6.176 

5.898 

5.714 

00 

8.000 

7.513 

7.163 

6.686 

6.369 

6.140 

5.892 

5.616 

5.432 


f Multiply by 10 4 
* Multiply by 10 2 
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7.383 

7.135 

6,967 

6.845 


6,680 6.455 

6.370 6.142 

6.084 5.850 


90*04 _ 

47.723 47.35 

24.740 *24.422 
17.647 17.365 
14.361 14.100 
12.499 12.250 
11.310 11.068 
10.488 10.252 

9.239 9.010 

8.539 8.314 

8.093 7.869 

7.783 7.561 


5 

81.991* 

83.352* 

85.093* 

86.l60 r 

86.88 t 

— 

— 

6 

3.009* 

3,014* 

3.020* 

3.024 卞 

3.02 7 ^ 

3*029 r 

3.032 卞 

7 

93.762 

93.042 

92.102 

91.515 

91.113 

90.705 

90.29 

8 

51.339 

50.646 

49.739 

49.170 

48.780 

48.382 

47.973 

10 

27.667 

27.115 

26.387 

25.927 

25.610 

25.284 

24.947 

12 

20.169 

19.701 

19.079 

18.683 

18.409 

18.124 

17.830 

14 

16.643 

16.224 

15.666 

15.309 

15.059 

14.800 

14.530 

16 

14,624 

14.239 

13/722 

13.389 

13.157 

12.914 

12.659 

18 

13.326 

12.963 

12.476 

12.161 

11.939 

11.708 

11.463 

20 

12.424 

12.078 

11.612 

11310 

11.097 

10.874 

10.637 

25 

11.046 

10.728 

10.297 

10.016 

9.817 

9.606 

9.381 

30 

10.270 

9.969 

9.559 

9.291 

9.099 

8.896 

8.679 

35 

9.774 

9.484 

9.088 

8.828 

8.642 

8.444 

8.230 

40 

9.429 

9.147 

8.761 

8.507 

8.325 

8.130 

7.919 

50 

8.982 

8.711 

8.339 

8.092 

7.915 

7.725 

7.518 

60 

8.706 

8.441 

8.077 

7.836 

7.662 

7.474 

7.269 

70 

8.517 

8,257 

7.899 

7.661 

7.489 

7.304 

7.100 

80 

8,381 

8,124 

7.770 

7.535 

7.365 

7.181 

6.978 

100 

8.197 

7.945 

7.597 

7.365 

7.197 

7.014 

6,813 

200 

7.850 

7.607 

7.271 

7.045 

6.881 

6.702 

6.503 

oo 

7.531 

7.295 

6.970 

6.750 

6.590 

6.414 

6.217 


tMultiply by 10 4 , 
•Multiply by 10 2 , 


TABLE B.2 (Continued) 


5% Significance Level 


n\w 


12 


15 


20 


25 


40 


1 2 1 
6 14 2 
19 7 6 
7 ^ ^ 6 ^ 
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TABLE B.2 (Continued) 

1% Significance Level 
5 


n\m 

5 

6 

8 

10 

12 

15 

20 

25 


5 

20.495* 

20.834* 

21.267* 

21.53* 

— 

— 

_ 

一 

一 

6 

15.014* 

15.019* 

15.025* 

15.029* 

15.033* 

15.03* 

15.06* 

一 

— 

7 

2.735* 

2.704* 

2,665* 

1640* 

2.623* 

2.606* 

2.590* 

2.579* 

— 

8 

1.150* 

1.128* 

1.099* 

1.081* 

1.069* 

1.057* 

1.044* 

1.036* 

• — 

10 

48.048 

46.670 

44.877 

43.758 

42.992 

42.210 

41.408 

40.921 

_ 

12 

31.108 

30.065 

28.701 

27.846 

27.257 

26.653 

26.031 

25.648 

25.06 

14 

24.016 

23.145 

22.001 

21.279 

20.781 

20.268 

19.736 

19.408 

18.90 

16 

20.240 

19.472 

18.459 

17.817 

17.373 

16.913 

16.435 

16.138 

15.678 

18 

17.929 

17.228 

16.302 

15.713 

15.304 

14.878 

14.435 

14.159 

13.727 

20 

16.380 

15.727 

14.862 

14.310 

13.925 

13.525 

13.105 

12.843 

12.431 

25 

14.107 

13.529 

12.759 

12.265 

11.918 

11.555 

11.172 

10.930 

10.547 

30 1 

12.880 

12.345 

11.629 

11.167 

10.842 

10.500 

10.136 

9.906 

9.538 

35 

12.115 

11.607 

10.926 

10.486 

10.174 

9.845 

9.494 

9.271 

8.911 

40 _ 

11.593 

11.105 

10.448 

10.022 

9.720 

9.401 

9.058 

8.839 

8.484 

50 

10.928 

10.465 

9.841 

9.434 

9.144 

8.836 

8.504 

8.290 

7.940 

60 

10.523 

10,076 

9.471 

9.076 

8.794 

8.493 

8.167 

7.956 

7.609 

70 

10.251 

9.814 

9.223 

8.835 

8.559 

8.263 

7.941 

7.732 

7.386 

80 

10.055 

9.626 

9.045 

8.663 

8.390 

8.097 

7.779 

7.571 

7.225 

100 

9.793 

9.374 

8.806 

8.432 

8.164 

7.876 

7.561 

7.355 

7.009 

200 

9.306 

8.907 

8.363 

8.004 

7.745 

7,465 

7.157 

6.953 

6.606 

00 

8.863 

8.482 

7,961 

7.615 

7.365 

7,093 

6.790 

6.588 

6.236 


•Multiply by 10 2 . 
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TABLE B.2 (Continued) 


5% Significance Level 

p «= 6 


n\m 

6 

8 

10 

12 

15 

20 

25 

30 

35 

10 

45.722 

44.677 

44.019 

43.567 

43.103 

42.626 

42.334 

42.136 

41.993 

12 

28.959 

28.121 

27.590 

27.223 

26.843 

26.451 

26.209 

26.044 

25.925 

14 

22.321 

21.600 

21,141 

20.821 

20.489 

20.144 

19.929 

19.783 

19.677 

16 

18.858 

18.210 

17.795 

17.505 

17.202 

16.886 

16.688 

16.553 

16.455 

18 

16.755 

16.157 

15.772 

15.501 

15.218 

14.921 

14.735 

14.607 

14.513 

20 

15.351 

14.788 

14.424 

14.168 

13.899 

13.615 

13.436 

13.313 

13.223 

25 

13.293 

12.786 

12.456 

12.222 

11.975 

11.711 

11.544 

11.428 

11.343 

30 

12.150 

11.705 

11.395 

11.173 

10.939 

10.687 

10.526 

10.414 

10,331 

35 

11.484 

11.031 

10.733 

10.520 

10.293 

10.049 

9.892 

9.782 

9.700 

40 

11.009 

10.571 

10.282 

10.075 

9.853 

9.614 

9.460 

9.351 

9.270 

50 

10.402 

9.983 

9.706 

9.507 

9.293 

9.060 

8.908 

8.801 

8.721 

60 

10.031 

9.625 

9.355 

9.160 

8.951 

8.721 

8.572 

8.465 

8,385 

70 

9.781 

9.383 

9.118 

8.927 

8.720 

8.494 

8.345 

8.239 

8,159 

80 

9.601 

9.209 

8.948 

8.759 

8.555 

8.330 

8.182 

8.076 

7.996 

100 

9.360 

8.976 

8,720 

8.534 

8.333 

8.110 

7.96.1 

7.857 

7.777 

200 

8.910 

8.542 

8.295 

8,115 

7.919 

7.701 

7.555 

7.449 

7,369 

500 

8.659 

8.300 

8.059 

7.882 

7.689 

7.473 

7.328 

7.222 

7.140 

1000 

8,579 

8.223 

7.983 

7.808 

7,616 

7.400 

7.255 

7.149 

7.067 

00 

8500 

8.146 

7.908 

7.734 

7,543 

7.328 

7.183 

7.077 

6.994 




TABLE B.2 (Continued) 


1% Significance Level 

p x 6 


n\m 

6 

8 

10 

12 

15 

20 

25 

30 

35 

10 

86.397 

83.565 

81.804 

80.602 

79.376 

78.124 

77.360 

76.845 

76.474 

12 

46.027 

44.103 

42.899 

42.073 

41.227 

40.359 

39.826 

39.466 

39.206 

14 

32.433 

30.918 

29.966 

29.309 

28.634 

27.936 

27.507 

27.215 

27.004 

16 

25.977 

24.689 

23.875 

23.311 

22.729 

21126 

21.753 

21.498 

21.314 

18 

22.292 

21.146 

20.418 

19.913 

19.389 

18.844 

18.505 

18.273 

18.105 

20 

19.935 

18.886 

18.217 

17.752 

17.267 

16,761 

16.445 

16.229 

16.071 

25 

16.642 

15.737 

15.156 

14.749 

14.324 

13.875 

13.592 

13.397 

13.254 

30 

14.944 

14.118 

13.586 

13.211 

12.816 

12.398 

12.133 

11.949 

11.814 

35 

13.913 

13.138 

12.635 

12.281 

11.906 

11.506 

11.252 

11.074 

10.943 

40 

13.223 

12.482 

12.000 

11.659 

11.298 

10.911 

10.663 

10.490 

10.361 

50 

12.358 

11.661 

11.206 

10.882 

10.538 

10.167 

9.927 

9.759 

9.633 

60 

11.839 

11.169 

10.730 

10.417 

10.083 

9.721 

9.486 

9.320 

9.196 

70 

11.493 

10.841 

10.413 

10.107 

9.779 

9.424 

9.192 

9.028 

8.905 

80 

11.246 

10.607 

10.187 

9.886 

9.563 

9.212 

8.983 

8.819 

8.697 

100 

10.917 

10.295 

9.886 

9.592 

9.276 

8.930 

8.703 

8.541 

8.419 

200 

10.312 

9.723 

9.333 

9.052 

8.748 

8.412 

8.190 

8.030 

7.908 

500 

9.980 

9.409 

9.030 

8.755 

8.458 

8.128 

7.907 

7.747 

7.625 

1000 

9.874 

9.308 

8.933 

8.661 

8.365 

8.037 

7.817 

7.657 

7.534 

cc 

9.770 

9.210 

8.838 

8.568 

8.274 

7.948 

7.728 

7.568 

7.446 
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TABLE B.2 (Connni/ed') 



5% Significance Level 

n\m 

i, 

10 

12 

P 讓 

15 

7 

20 

25 

30 

35 

10 

85.040 

84.082 

83.426 

82.755 

82.068 

81.648 

81.364 

81.159 

12 

42.850 

42.126 

41.627 

41,113 

40.583 

40.257 

40.037 

39.877 

14 

29.968 

29.373 

28,^61 

28.534 

28.091 

27.817 

27.631 

27.493 

16 

24.038 

23.519 

23.158 

22.781 

22.389 

22.145 

21.978 

21.857 

18 

20.692 

20.222 

19.893 

19 549 

19.189 

18.964 

18.809 

18 696 

20 

18.561 

18.125 

17.819 

17.498 

17.159 

16.947 

16.800 

16.694 

25 

15.587 

15.202 

14.930 

14.642 

14.337 

14.143 

14.009 

13.9il 

30 

14.049 

13.693 

13.440 

13.172 

12.884 

12.701 

12.573 

12.478 

35 

13.113 

12.776 

12.535 

12.278 

12.002 

11.825 

11.700 

11.608 

40 

12.485 

12.160 

11.927 

11.679 

11.411 

11.237 

11.115 

11.025 

50 

11.695 

11.386 

11.165 

10.927 

10.668 

10.500 

10.381 

10.292 

60 

11.219 

10.921 

10.706 

10,475 

10,221 

10.056 

9.938 

9,850 

70 

10.901 

10.610 

10,400 

10173 

9.923 

9.760 

9.643 

9.555 

80 

10.674 

10.388 

10.181 

9.957 

9.710 

9.548 

9,432 

9.344 

100 

10.371 

10.091 

9.889 

9.669 

9.426 

9.265 

9.150 

9.062 

200 

9.812 

9.545 

9.350 

9.138 

8.902 

8.744 

8.629 

8.542 

500 

9.504 

9.244 

9.054 

8.846 

8.613 

8.456 

8.342 

8.254 

1000 

9.405 

9.148 

8.959 

8.753 

8.521 

8.365 

8.250 

8.162 

00 

9.308 

9.053 

8.866 

8.661 

8.431 

8.275 

8.160 

8.072 



TABLE B.2 (Continued) 



1% Significance Level 

n\m 

8 

10 

12 

P ■ 1 

15 20 

25 

30 

35 

10 

185.93 

182.94 

180.90 

178.83 

176.73 

175.44 

174.57 173.92 

12 

71.731 

69.978 

68.779 

67.552 

66.296 

65.528 

65.010 

64.636 

14 

44.255 

42.978 

42.099 

41.197 

40.269 

39.698 

39.311 

39.032 

16 

33.097 

32.057 

31.339 

30.599 

29.834 

29.361 

29.039 

28.806 

18 

27.273 

26.374 

25.750 

25.105 

24.435 

24.019 

23.735 

23.529 

20 

23.757 

22,949 

22.388 

21.804 

21.195 

20.816 

20.556 

20.367 

25 

19.117 

18.440 

17.965 

17.469 

16.947 

16.619 

16.392 

16.227 

30 

16.848 

16.239 

15.810 

15.360 

14.882 

14.580 

14.370 

14,216 

35 

15.512 

14.945 

14.544 

14.121 

13.670 

13.383 

13.183 

13.036 

40 

14.634 

14.095 

13.713 

13.309 

12.876 

12.599 

12.405 

12.262 

50 

13.553 

13.049 

12.691 

12.310 

11.899 

11.634 

11.448 

11.309 

60 

12.914 

12.432 

12.088 

11.720 

11.323 

11.065 

10.882 

10.746 

70 

12.492 

12.024 

11.690 

11.332 

10.942 

10.689 

10.509 

10.374 

80 

12.193 

11.736 

11.408 

11.056 

10.673 

10.422 

10.244 

10.110 

100 

11.797 

11.353 

11.034 

10.691 

10.316 

10.070 

9.894 

9.761 

200 

11.077 

10.658 

10.356 

10.028 

9.667 

9.427 

9.254 

9123 

500 

10.685 

10.230 

9.987 

9.668 

9.314 

9.078 

8.906 

8.774 

1000 

10.561 

10.160 

9.869 

9.553 

9.202 

8.966 

8.795 

8.663 

00 

| 10.439 

10.043 

9.755 

9.441 

9.092 

8.857 

8,686 

8.555 
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TABLE B.2 (Continued) 
5% Significance Level 

p = 8 


n\m 

8 

10 

12 

15 

20 

25 

30 

35 

14 

42.516 

41.737 

4U98 

40.641 

40.066 

39.711 

39.470 

39.296 

16 

31.894 

31.242 

30.788 

30.318 

29.829 

29.525 

29.318 

29.167 

18 

26.421 

25,847 

25.446 

25.028 

24.591 

24.319 

24.132 

23.996 

20 

23.127 

22.605 

22239 

21.856 

21.454 

21,201 

21.028 

20.902 

25 

18,770 

18.324 

18.009 

17.677 

1X325 

17.102 

16.947 

16.834 

30 

16.626 

16.221 

15.934 

15.629 

15.303 

15.095 

14.950 

14.843 

35 

15.356 

14.977 

14,707 

14.418 

14.109 

13.910 

13.771 

13.668 

40 

14.518 

14.156 

13.898 

13.621 

13.322 

13.129 

12.994 

12.893 

50 

13.482 

13.142 

12.898 

12,636 

12.351 

12.165 

12.034 

11.936 

60 

12.866 

12.540 

12.305 

12.051 

11.774 

11.593 

11.465 

1L368 

70 

12.459 

12.142 

1L912 

1L665 

1L393 

11.215 

11.088 

10.992 

80 

12.169 

11.858 

11,634 

11.390 

11,122 

10.946 

10.820 

10.725 

100 

11.785 

11.483 

11,264 

11.026 

10.763 

10.590 

10.465 

10.370 

200 

11.084 

10/798 

10.589 

10.362 

10.108 

9:939 

9.816 

9.722 

500 

10.701 

10.423 

10.221 

9.999 

9.751 

9.584 

9.461 

9 367 

1000 

10.579 

10.304 

10.104 

9.884 

9.637 

9.470 

9.348 

9.254 

00 

10.459 

10.188 

9.989 

9.771 

9.526 

9.360 

9.238 

9.144 
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TABLE B2 {Continued) 


n\m 

1% Significance Level 

8 

10 

12 

P = 

15 

= 8 

20 

25 

30 

35 

14 

65.793 

64.035 

62.828 

61.592 

60323 

59.545 

59.019 

58.639 

16 

44.977 

43.633 

42.707 

41.754 

40,771 

40.164 

39.753 

39.456 

18 

35.265 

34.146 

33,373 

32.573 

31.745 

31.232 

30.882 

30.629 

20 

29.786 

28.808 

28.129 

27.425 

26.691 

26.235 

25.924 

25.697 

25 

23.001 

22.212 

21.661 

21.085 

20.480 

20,100 

19.838 

19.647 

30 

19.867 

19.173 

18.686 

18.173 

17.631 

17.288 

17,051 

16.876 

35 

18.077 

17.440 

16.991 

16.516 

16.011 

15.690 

15.466 

15,301 

40 

16.924 

16.324 

15.900 

15.451 

14.970 

14,662 

14.447 

14,288 

50 

15.528 

14.975 

14,582 

14.163 

13.711 

13,420 

13.216 

13.063 

60 

14.715 

14,190 

13,815 

13.414 

12.980 

12,698 

12.499 

12.351 

70 

14.184 

13,677 

13.313 

12.925 

12.502 

12.226 

12.031 

11.885 

80 

13,810 

13.315 

12.960 

12.580 

12465 

IL894 

11.701 

11.556 

100 

13.317 

12.839 

12.496 
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TABLE B.2 (Continued) 
1% Significance Level 

p = \0 
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TABLE B.3 {Continued) 
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TABLE B,3 (Continued) 


2.722 
3.1 尨 
3.73i 
4,173 
4.501 

4.665 

5.270 

5.723 
5.970 
6,m 

6.513 

6.612 


17 

19 

23 

27 

31 

37 

47 

67 

87 

127 

247 

oo 


17 

19 

23 

77 

31 

V 

47 

G 

ft7 

127 

247 

ao 


10.895 iQ.37\ 9.973 9.627 9.395 
U.OW 10.448 吡 031 9.721 9.479 
1U3U tO,A55 10.207 9.876 9.619 
U.SiO 10.615 10.344 9.999 9.731 
10.9<J 10.455 10.090 9.821 

11.645 11.092 iO.W i0.7iS 9.930 
I2.0i6 〖 1.2 辟 10.742 10.357 TO.041 
12.304 11.482 10.932 10 .边 10.224 
12.449 U.605 U.043 10.635 10.321 
12.607 U.7<5 IU68 10.751 10. <51 

12.762 11.897 11.306 10.883 10.557 
12.977 12.070 1L446 U.034 10.703 


9.W 9.049 8.915 
9,233 9.121 6.9© 
9. M2 9.240 9.095 
9.515 9.337 9.186 
9.599 9.416 9.261 


9.700 9.5U 
9.624 9.628 

9. m 9.776 
UJ.070 9.6A5 
10.176 9. W 

10.298 10. »5 9.906 

10. <59 1 0.223 10.043 


9.351 

9.463 

9.605 

9,691 

9,790 


B. <57 8.345 
B.4&4 8.390 
B, 如 8.464 
B.626 8.523 
B.676 8.572 

B.737 8.d30 
B.BII 8.701 
8,902 8,789 
& 95i 8.W2 
8,903 

9.089 0.972 
9.170 9.053 


.05 



p»5 



s:eoip:«4:s:?r 2 ?:254.328 

獄 10953 523931235740913 
6*6.7.7.7.7.7.7.7.7.7.7. 

:^:^.471:^s:^.e3ii 
7.7.7.7.7.7.7.7.7 7 7 7 

;i:523,570:^:^s94i:ol7 
7.7.7.7.7.7.7.7.7.7.7.8 

3 茲想 ssSSs 

•**»* • » » » L 

-7777 77778 8 B 

g7n777i sl0silosl«s23a303 

»»»■■)• • • • ^ ^ _> 

-7777 77688 8AO 


gl37213s3283924743el^827s 

**•»■»f*i»• • 

77777 77777 77 

S 403 I 盅铝 8 W 97 I 0& S 154774 
77777 77778 88 

788S 諡 119 發 l520l7li§ 
7.7.7.6.6.B.B.8.B.B.8.B. 

917 败 loo 沿忠§淠 S § 

7.7.8.a.8.aa&aeda9. 

§?铠5笾794鞀筇 

A^-8BS6 66fiB8 9-9 

2£2盟滢流027以遝 

B . B . a . a .0. d . fl .9,9.9.9.9. 


72B7asg§0070735 琪 359 溫 I476M77U 趙 Q30I3191I7 3 羯 

7.7* 7*7.B.B.8.6.8.6.B.B_.8.<dB.B.B.9.9.9.9.9.9.9. 


:^.093>72234.:S?.M?«7B.755 
7.7.8.86.8.8.6.6.8.8.B. 
193 頜 sl^B3i88s9£l<Q4lw 
A&8888 B8888 99 
舭 761721)03949919187159 现 23 

& 6 8^y o 013^4 s 6 

8.B.6.6.9.9.9.9.9.9.9.9. 

247 溫 S7W897024 逍 328417513 
V.o^9.9.o^9.io.io,0.l0,1 o.io: 


767:^:^:^;912:^ 
6.B.9.9.9 9 99 99.0.0 

157277 思 734671§ 逛 457593i 

• »_•••p fc L- L- ^ > _* 

99999 910101010102 
— 310^3 

7287 ::: 笊议於 7ffi s3274y6(r 
« - • • • •■»■«»• 
99000 oom 11 

M4S2025435 搿 382s715g10e 

<■••• • • * * M » I* 

o o l I 12 )222 2 3 


msg 溫 7909M3 锶 

0.0.I I h^.2.222 


.M3.874g:9«9m34g.*3l 
7777B B8M&fco8 &8 

i:?57 7 ;^i1.^5>3:^ 

ftj AO oo fto fto AO 3^ fto 


?2333>0^;l7&s.854:^ 
a.s.&lla.8.aa<Da.9. 
浩 §676738suw§11030>2r 

a.a.B.S.&&.8.9.9.9.9.9. 


W:f3?;:^:^.i33:s§ 

It it It n i l 


W1I&45137449IJ 溫證 7WS§ 
L2.2.12,L2l3 4 13.L3.la.13.l3.M < 


19640- 1,6796 
4^29^ ^ l 3 i l 
l»479l» 

u m2.3.3.3,3.4.4.4. 


$ 310974 3 .822230 額 ^5 475 
丄 V4.5.5.H7.7.7.s.a. 

lull lull 11 


162024M323646 胡 &82646BD 


I620242e323a4a&8w2846oc 


05 


:^:^ 一 :^;. 155:^ 

8.6.6.cd8.fl,8.9.9'9.9,9. 

没屈 .950.^:^.320:^ 

B.B.B.B.B.9.9.9.9.9.9.9. 
B3S 羝 073135206294398 盅溫 
B.B.6,9.9.9.9-9.9I9.9.9. 

ilov 溫 3743 笾 722793872.940 
9.9.9.<^9.9.9. 免 9.9.9.9. 

M6962522w6790)12g56s3 
3356yo 700^1 
9 .免 m9.9.o.o.o.Q.o, 

1917 旳腆 78»l99a7M989090 
78^v-ol 23566 78 
9.o^9.o.o.o.o.o.o.o.o.o. 


675 









TABLE B,3 (Continued) 


3 4 5 6 7 B 9 10 IS 20 


13*027 12.603 12.3U 12.093 11.922 U.7B2 11.666 11.564 11.222 U.m 

13.147 12.700 12393 12.164 U.W U MO 1L719 1L616 \\ t 26Q ILW5 

13.340 11W7 i2‘526 m 12.091 U.W M. 808 11.699 1K325 U,l00 

I3w4fi7 12,770 12 631 12.375 12. [76 L2.0L5 1L66L k 1,76a U.300 lU 鉍 

13*603 13.075 12.716 12.451 12.245 12 079 M.94I ll‘OW )L4» IUW 

13.737 t3.L66 IZ616 12.541 12^328 i7AU L2.QU U.B93 IV 4$2 U •议 

13*695 11323 12.93A 42.A51 12.430 12.251 12」04 H.W U.SSS 1L301 

1(063 13.466 13.083 12.786 12.SS^ 12.37i 12.218 12.069 I1.&50 U.368 

UJ91 13.561 13.169 12、 SM H632 \2.UZ IZ 266 1Z 156 U.710 U.443 

14.310 13.687 13.266 11957 12.716 \7.S76 12.366 12.234 11.781 W.SW 

U.443 13.806 U376 13.061 12.818 12.622 12.4AI 12.325 U.847 U.594 

14.591 11940 13.501 13.180 \2^23 12.735 11572 12*434 U.972 U*700 


U.384 13.698 13.256 IZ980 12.7W 12.S37 IZ371 12.228 M.733 U.432 

U.476 13.649 13.413 13.066 12.832 12.624 12.449 12.301 U.7B9 U.478 

14.780 14.096 13.621 13.260 12*??3 n.769 12.SB4 12.426 11.885 U.559 

15.029 1 4-290 13.7W (3.4^3 13. i23 I2.d68 12,693 12.52g U.9^5 U*428 

15.222 14. U7 13.920 13.531 13.230 \7.9S6 12.765 12.614 12.034 \\.667 

15.44« 14.632 14.060 13. 674 13.359 13,106 12.897 IZ720 12. U9 IL76I 

15.718 14.855 U274 13.64« 13.519 13.2SS 13.037 12.852 12.229 U.558 

16.(M5 15.130 U.516 U.067 13.722 13.446 13.21A 13.024 1 2.374 U.989 

U.23A 15.292 \4.M U. 199 13.S45 i3.U\ 13.327 13,130 12. 4^ 12.074 

U.4S0 15*475 1 4.824 U.3S0 13 .他 13.695 13.454 13.254 12577 i2.177 

16.690 IS,M3 15.012 U.S25 14^£2 1 3.853 13.608 13.402 12.712 12-307 

16.964 15.923 15,231 U.730 U Z46 U.041 11791 U.Sdl 12.661 12.472 


U.999 12.992 IZ043 tk.463 U.OM 10.I0.52t 10.328 10.167 10*030 9.£58 9.275 

15*463 1X235 12,215 U.598 U. 174 1 0.857 10.610 1 0.409 10.241 10.099 9,612 9.321 

iL 177 13.620 12.491 U.819 U.360 11.021 10.757 10.543 \Q 4 M 10.216 9*704 久 399 

16.700 iX 910 12 703 U.99i U.507 tL til 10.874 10.^51 10,467 ^310 9.7 抝 9,4^5 

17-100 14,137 12.871 11126 U.«6 ll‘2W 10.970 10.740 10'550 10.38? 9,844 9.521 

17.54$ U.398 13.047 1 2.290 H.7W 11.363 11.OW 10.W6 10.451 10 . 你 5 9.924 9.591 

ta058 14.702 13.297 )7.4B2 \^9J5 M.536 H.227 ia980 »0.776 10.60^ 10.024 9.631 

LaM) t5.058 13.573 ^7\S ^.143 IL725 11.403 It. 06 10.934 LD.7S6 10.1S5 9.801 

LB,962 \S.2S\ H733 a&S3 12 泌 U.838 H.509 比 247 H.030 ia&49 10.238 9.B77 

19.310 15.484 13.909 13.005 12403 H.9A5 U •& 30 M.3^2 U. 142 10.956 10.335 9.970 

19.664 15.72$ UAOi \3 \77 12.S59 12.112 比 769 H.49£ IL271 U.Dd3 10.453 tO.083 
20.090 16.000 14.337 13.371 12738 1Z280 U.930 U.6S1 11.424 M.233 10.597 10.227 


10 


7J3?s^§lt0023137s 绍 
mw4.V5.5.5*5*5. 


S 404 § 344 5s0 a43 i71 3i07i .307 
5.L5.ld.l^ld.u,l7.l7.»7,»7,ll 18 . 


2t23273ias4i5L719lk3l251_ 




溫941273535閒鏜141 3783 

i.s.5.A.6.6.7.7.7.le.6. 


197707505>0>s §s394m22l 绍 
I7.l7.18. 々 3a20.2l.21.22.22.23. 


2123273ta54l5L719il31251 


ot 


i92k2529a33949w£952498 


20 

15 

to 


g 溫 033070 出 2543043 笾 
e.e.e.n9.9*m9,9, 
§5198 桀 34440f494 iss.77l 

0^0^0^9>ow 0^0^0^0^9 0^9 

淠筘 s72s§897 溫 098? 

9.9.9.H9.mlo*lD.lo. 

親搿 772837 9 01 7 4 泛 2213|2 
OT9.9.9,9 4 nto.o.l(Ktalo 

ssi ga:i!:^: 

9.9.9.9.o^ 9.I0.O.0.10IO.IO. 

744 羧溫洁 i 390 i ii 
9. 9. 9, 9.io.o.io.lo.l0.1 0.1 0.1 o. 

ggnii .L1 

9.9.1 o.tQ.0,1 o.lo.l0.1 o.t0.(0.1 o. 
1 o.al0.1 0.1 o.io.t0,1 0.1 0.1 o.n.1 t 

;;i!g llf 113471^ 

o.o.o.o.o.o.n L l 11 

s94l109^44s580734s91fl§138 

10,10.u.uii:u.ii:lLu'lLal2. 

5 溫 054180323 思 |§0>514& 
L l. L CJl2.1 2.1 zlzl2.1 21 3.1 3. 
lol34i710i?377614«77§173.33550T 
1 11 3.13.1 3.1 4,1 4<l4,U,15:1 11 s.i5, 


19212529S3394969&9W249- 


05 


676 








2 812 

2. M7 
2.972 
3 019 
3.049 

3 123 

3. 1B1 


1.9 飞 9 
1.973 
2.059 
2. 124 
2. 176 

2,234 
2.303 
2,383 
2. 429 
2.47? 

2 ‘议 

2,596 


3 . 634 
3.767 
3.&S9 
3.927 

4-001 
4.081 
4, 169 
4.216 
4 265 

4,317 

4.371 


3. MO 
3.742 

3. 937 

4. Q74 
A. 176 

4.267 

4.W 

4.543 

4.6\6 

4692 

(772 


2.978 
3. 096 
3.W 
3.422 
3. 526 

3. 647 
3, 7 & 4 

3, 940 
4.077 

4, 120 

4.221 

4.331 


\ 7S2 
i.796 

1.915 

\.9S4 

1.9 时 
2.049 
2. 109 
2. 142 
2, 176 

2 217 
2.261 


2. 157 
2 211 
2,293 
2 352 
2.396 

2.445 

2.49fl 

2 、 55fl 

2,591 

2.636 

7 663 
2.7Q2 



p=3 


TABLE B.4 

Tables of Significance Points for the Roy Maximum Root Test 


Pr{^^ 


^ xA ^ <x 


I 2 3 4 5 6 7 8 9 10 )5 20 


§501606^746 
r*ic*ir<in 


16r019 

9C3&36 i7Mla>s^^5 
77&S9 ^7 o 

lllhl^L Ln^ 

39B2sl«3sH9al62)38)e2328953 
667 & av 9*9 o Q 1 112 
11LLL ILn^T^- ^ 


286 Q 2 77037 ^ 9 
55596774,79S93A. 1IX17 沁 36 

HI—^ I 1 l2,2.2-2ci 

7247fl2B77954al6ilB229s3&34415303 

u h 1—L2,2.2-2-2,2" n 


.ssli33eJias5701 

2.2.2.rj2.2.2.2.2,2.2.N 

784SO 24633 B 
7 7 4 S 6 J l 61 A Q 
^S667 7ff 

政 , -*•»?»*•»* Mr 

22222 22222 22 

54192095g16 听 73i20ga 舶 
2344s 6^7 b& 99 

m n n r* ri n 


24578IBL55 说 3?0177 奶 SS60 
1234s 4^7990 12 
2.2,2,2.7.2.mfri3.3 

29s384s273sB2193a07£153239^ 
2'm 2'2.2,3 .s. 3.3*3. 
2 9 4lnx- 0 

:? S5;i s 9r s5 29374£ 颂 6IL 
2.2,2.2-2,3.3.U3-3.3. 


694ss942)04734915 抝边脱 443 
67B9W 012^5 
2,2,2.2.13.3.3.3.3.3,3. 

025^7 974^m-^ 40 
ol09.2»3(rj435ts&s7075s 
3.a-3.3.3.3.33.3.3.3.3. 


H176 0 说 ?29)04 脱 P73^0?683IBi?o 
4.4.4*H5,5.5.s.5.5.5. 

9531*0 2^V^14 5 
8 0» A J-1 ao^w_^-32 Awn 
of 2 i. 6^7 78 

► A - * • *tltltltlt * 

67777 77777 77 


6 3K^H^ U4537 
46912 468yn^ H2 
s.uss,6.*^6.6.6.n7-7. 

7500/^ 6CW 
74.3<l *5 2 办 9i4 
92691 35^/9nv 

mo^io.o.io.o.Q.L u L 


14162024283444:26424448 


Uu2024263444«4ft42444ao 


aj 


沿 567Mli740807d91s§ g154 

i> ml —i i i It 1 It 

560^36 a«3woo* 23 
&26775<S.87?3<SI016.22 , 2937 
LI 11 u „.;2.2.2.2.2,2. 

7 3015 G 3 
9cf797(wu22 2 ?38 磡 54M W76 

11.2 22.2,m2-2.7 

wi?:^:^:732gg 

I2'm m2.2.2.2, 

092w797s48:;491077woj» 

ll^3d S^7d8 90 

mnur'inu 

4933li3593416l52<89w30& 
23^56 7 fl 9 9 1»2 
2.2.2.2.oi m2-3-3.3 

2.2.2. ci2.23.3 33.3.3. 
則《 3 <期 

WW 7901 6 7 

2.2.2. H3.3.3'33.3.3. 

MO54sM 6 s4042iis81gsooB2 
oulJiis 6 7 & 9 o ! 

nm HH4.14 


: 7 8 3 07 4 :^: 0 0;::: 2 72 :^旧 

4wiiii.717Js3.69:?i93(w.99 

5,5.5.S.5.S.5.5,5 ,&5 .5 


i3>si?232733«:3m2343« 


05 


S4439 5075 W Mt w 
67 咖 32233465§74s9(w9908c 
m5. 么 5,5.s.5,wi */!i6. 

99s07roA49oo^Q3?20m08lQ 
470^3 12 

7.7.6,8.B.B..B.8.B,9,9.9. 


1315192327334363M23«ao 


01 


332ls71543l213857J32l5}0e12 
S67 & 9 ol 
m n pi n cn crj 3* 


§.7B7^lB0!:iis SI 
ri r5 m i 3, m 3-3. 


2^706 3S 
67 

4579ff 7d 
i3.3.14.4.W4.4.4.4. 
J^57nl» CW 661 /F i,? 

st^^9 02 
125^8 9 12 3 4 6 7 

m n4>wn & & j 5' 


407436i52155256863li:^797fl47 
527s6id51: 7 ^IS.?099£lw 


si: 9 4 0 3033Q795is25?ss 

I’ 一 n 丨 2.2.2.2.2,2.22. 

>0S?B3»72975SSS31 议 39SS34 

― ^ -^2,2.2,2.2.2.2,2.n 

0>s06?14B204247294s4a7f5sl33 
mumrioi ri2 


S40l487i 4 v 8 36436977577fl92 2697 

I * ^ t J • • 4 • I. —- 

22222 22222 22 

5fl92B B3Q47 c 9 
60175SS69I97.0XS09IXI6 
2 m2.m3.3.3.3, 

18 3 a - 沾必 19 Q 215 泊 
S0717 姊脱尥 395455154 诹 
3.3'm3.3-3.H3— 33. 


677 










TABLE BA (Continued) 
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TABLE B.5 

Significance Points for the Modified Likelihood Ratio Test of 
Equauty of Covariance Matrices Based on Equal Sample Sizes 
Pr{ — 21ogX* >x}= 0.05 




10 


n 

12 

13 


23456789 10 


p = 2 

12.18 18.70 24.55 30.09 35.45 40.68 45.81 50.87 55.87 

10.70 16.65 22.00 27.07 31.97 36.76 41.45 46.07 50.64 

9.97 15.63 20.73 25.56 30.23 34.79 39.26 43.67 48.02 

9.53 15.02 19.97 24.66 29.19 33.61 37.95 42.22 46.45 

9.24 14.62 19 46 24.05 28.49 32.82 37.07 41.26 45.40 

9.04 14.33 19.10 23.62 27.99 32.26 36.45 40.57 44.65 

8.88 14.11 18.83 23.30 27.62 31,84 35.98 40.06 44.08 

8.76 13.94 18.61 23.05 27.33 31.51 35.61 36.65 43.64 

p = 3 

19.2 30.5 41.0 51.0 60.7 70.3 79.7 89.0 98.3 

17.57 28.24 38.06 47.49 56.68 65.69 74.58 83.37 9209 

16.59 26.84 36.29 45.37 54.21 62.89 71.45 79.91 88.29 

15.93 25.90 35.10 43.93 52.54 60.99 69.33 77.56 85.72 

15.46 25.22 34.24 42.90 51.34 59.62 67.79 75.86 83.86 

15.11 24.71 33.59 42.11 50.42 58.58 66.62 74.57 82.45 

14.83 24.31 33.08 41.50 49.71 57.76 65.71 73.56 81.35 

14.61 23.99 32.67 41.01 49.13 57.11 64.97 72.75 80.46 

14.43 23.73 32.33 40.60 48.66 56.57 64.37 72.08 79.72 

p ■ 4 

30.07 48.63 65.91 82.6 98.9 115.0 131.0 — - 

27.31 44.69 60.90 76.56 91.89 107.0 121.9 137.0 152.0 

25.61 42.24 57.77 72.78 87.46 101.9 116.2 130.4 144.6 

24.46 40.56 55.62 70.17 84.42 98.45 112.3 126.1 139.8 

23.62 39.34 54.05 68.27 82.19 95.91 109.5 1 22.9 1 36.3 


11 

12 

13 

14 

15 


22.98 38.41 52.85 66.81 80.49 93.95 107.3 120.5 133.6 

22.48 37.67 51.90 65.66 79.14 92.41 105.5 118.5 131 5 

22.08 37.08 51.13 64.73 78.04 91.16 104.1 117.0 129.7 

21.75 36.59 50.50 63.96 77.14 90.12 103.0 115.7 128 3 

21.47 36.17 49.97 63.31 76.38 89.25 102.0 114.6 127.1 


681 



TABLE B.5 (Cotmmied) 



2 

3 

4 

5 

6 

7 

W 

2 

3 

4 

5 




p = 5 






p = 6 



8 

39.29 

65.15 

89.46 

113.0 

—— 

— 

10 

49.95 

84.43 

117.0 

— 

9 

36.70 

61.40 

8463 

107.2 

129.3 

151.5 






10 

34.92 

58.79 

81.25 

103.1 

124,5 

145.7 

11 

47.43 

80.69 

112.2 

142.9 








12 

45.56 

77.90 

108.6 

138.4 

11 

33.62 

56.86 

78.76 

100.0 

120,9 

141.6 

13 

44.11 

75.74 

105.7 

135.0 

12 | 

32.62 

55,37 

76.83 

97.68 

118.2 

138.4 

14 

42.96 

74.01 

103.5 

132,2 

13 

31.83 

54.19 

75.30 

95.81 

116.0 

135.9 

15 

42.03 

72.59 

101.6 

129.9 

14 

31.19 

53.24 

74.06 

94.29 

114.2 

133.8 

I 





15 

30.66 

52.44 

73.02 

93.03 

112.7 

132.1 

16 

41.25 

71.41 

1001 

128.0 








17 

40.59 

70.41 

98.75 

126.4 

16 

30.21 

51,77 

72.14 

91.95 

111.4 

130.6 

18 

40.02 

69.55 

97.63 

125.0 








19 

39.53 

68.80 

96.64 

123.8 








20 

39.11 

68.14 

95.78 

122.7 
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TABLE B.6 

Correction Factors for Significance Points for th£ Sphericity Test 



r 

5 % 

Significance Level 



n\p 

3 

4 

5 

6 

7 

8 

4 

1.217 






5 

1.074 

1.322 





6 

1,038 

U 22 

1.383 




7 

1,023 

1,066 

1.155 

1,420 



8 

1,015 

1.041 

1.088 

1.180 

1.442 


9 

1.011 

1.029 

1.057 

1,098 

1.199 

1.455 

10 

1‘008 

1.021 

1.040 

l ‘ 07 l 

1.121 

1.214 

12 

1.005 

1.013 

1.023 

1,039 

1.060 

1.093 

14 

1*004 

1.008 

1.015 

1,024 

1.037 

1.054 

16 

1*003 

1.006 

1.011 

1.017 

1.025 

1.035 

18 

1.002 

1,005 

1.008 

1.012 

1.018 

1.025 

20 

1.002 

1.004 

1.006 

1.010 

1.014 

1.019 

24 

1.001 

1.002 

1.004 

1.006 

1.009 

1.012 

28 

1.001 

1.002 

1 003 

1.004 

1.006 

1.008 

34 

1.000 

1.001 

1.002 

1.003 

1.004 

1.005 

42 

1.000 

1.000 

1,001 

1.001 

1.002 

1,002 

1.003 

50 

1.000 

1.001 

1 删 
L 000 

1,002 

1,002 

too 

1.000 

1.000 

1.000 

1.000 

1.000 

x 2 

11,0705 

16 . 9 H 0 

23.6848 

31.4104 

40.1133 

49.8018 



TABLE B .6 ( Continued ) 


n\p 


1 % 

Significance Level 



3 

4 

5 

6 

7 

8 

4 

1.266 






5 

1.091 

1.396 





6 

1.046 

1.148 

1.471 




7 

1.028 

1.079 

1.186 

1.511 



8 

1.019 

1.049 

1.103 

1.213 

1.542 


9 

1.013 

1.034 

1.061 

L 123 

1.234 

L 556 

10 i 

1.010 

1.025 

1.047 

1.081 

U 38 

1.250 

12 ! 

1.006 

1.015 

1.027 

1.044 

1.068 

1.104 

14 

1.004 

1‘010 

1.018 

1.028 

1.041 

1.060 

16 

1.003 

1.007 

1.012 

1.019 

1.028 

1.039 

18 

1.002 

1.005 

1.009 

1.014 

1.020 

1.028 

20 

1.002 

1,004 

1.007 

1.011 

1.015 

1.021 

24 

1.001 

1.003 

1.005 

1.007 

1.010 

1.013 

28 

1.001 

1.002 

1.003 

L 005 

1.007 

1.009 

34 

1.001 

1.001 

1.002 

1.003 

1.004 

1,006 

42 

1.000 

1.001 

1.001 

1.002 

1.003 

1.003 

50 

1.000 

1.001 

1-001 

1.001 

1.002 

1.002 

100 

1.000 

1.000 

1.000 

1.000 

1.000 

1.001 

x 2 

15.0863 

21.6660 

29.1412 

37.5662 

46.9629 

57.3421 
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TABLE B.7 f 

Significance Points for the Modified Likelihood Ratio Test 2 = 2 0 
Pr{ - 2 log > x} = 0.05 


n 

5% 

i% 

n 

5% 

1% 

n 

5% 

i% 

n 

5% 

1% 


P = 

2 


p = 3 



p= 5 



p — 6 


2 

13.50 

19.95 

4 

18.8 

25.6 

9 

32,5 

40,0 

12 

40.9 

49.0 

3 

10.64 

15.56 

5 

16.82 

22.68 

10 

31.4 

38,6 

13 

40.0 

47.8 

4 

9.69 

1413 







14 

39.3 

47.0 

5 

9.22 

13.42 

6 

15.81 

21.23 

11 

30.55 

37.51 

15 

38.7 

46.2 




7 

1519 

2036 

12 

29.92 

36,72 




6 

8.94 

13,00 

8 

14.77 

19.78 

13 

29.42 

36.09 

16 

38.22 

45,65 

7 

8.75 

12,73 

9 

14.47 

19.36 

14 

29.02 

35*57 

17 

37.81 

45.13 

8 

8.62 

12.53 

10 

14,24 

19.04 

15 

28.68 

35,15 

18 

37,45 

44.70 

9 

8.52 

12.38 







19 

37.14 

44,32 

10 

8.44 

12.26 

11 

1406 

18.80 

16 

28.40 

34.79 

20 

36.87 

43.99 




12 

13.92 

18.61 

17 

28.15 

34.49 

21 

36.63 

43.69 


p =* 4 


13 

13.80 

18.45 

18 

27.94 

34.23 




7 

25.8 

30.8 

14 

13/70 

18.31 

19 

27,76 

34.00 

22 

36.41 

43.43 

8 

24.06 

29,33 

15 

13.62 

18.20 

20 

27.60 

33.79 

24 

36.05 

42.99 

9 

23.00 

28.36 







26 

35,75 

42.63 

10 

22.28 

27.66 







28 

35.49 

42,32 










30 

35.28 

42.07 

11 

21.75 

27.13 










12 

21.35 

26.71 










13 

21.03 

26.38 










14 

20.77 

26.10 










15 

20.56 

25.87 











P = 

7 


P = j 



P = 9 

i 


p = 1( 

) 

18 

48.6 

56.9 

24 

58.4 

67.1 

28 

7(X1 

79.6 

34 

(82.3) (92.4) 

19 

48.2 

56.3 

26 

57,7 

66.3 

30 

69.4 

78.8 

36 

81.7 

91.8 

20 

47.7 

55.8 

28 

57.09 

65.68 




38 

81.2 

91.2 

21 

47.34 

55,36 

30 

56.61 

65.12 

32 

68.8 

78.17 

40 

80.7 

90.7 

22 

47,00 

54.96 




34 

68.34 

77,60 







32 

56.20 

64.64 

36 

(67 91) (77.08) 

45 

79.83 

89.63 

24 

46.43 

54,28 

34 

55.84 

64.23 

38 

(67.53) (76.65) 

50 

79.13 

88.83 

26 

45.97 

53.73 

36 

55.54 

63,87 

40 

67.21 

76.29 

55 

78.57 

88.20 

28 

45.58 

53,27 

38 

55.26 

63.55 




60 

7813 

87.68 

30 

45.25 

52,88 

40 

55.03 

63.28 

45 

66.54 

75.51 

65 

77.75 

87.26 

32 

44.97 

52.55 




50 

66.02 

74.92 




34 

44.73 

5127 




55 

65.61 

74.44 

70 

77.44 

86.89 







60 

65.28 

74.06 

75 

77.18 

86.59 


t Entries in parentheses have been interpolated or extrapolated mto Korin's tabic. 
p * number of variates; N = number of observations; n = log|2 0 [ - 

np 一 n log|5| + n tr(52o ! )» where S is the sample tovariance matrix. 




References 


At the end of each reference in brackets is a list of sections in which that 

reference is used 

Abramowitz, Milton, and Irene Stegun (1972), Handbook of Mathematical Functions 
with Formulas, Graphs, and Mathematical Tables ， National Bureau of Standards. 
U.S. Government Printing Office ， Washington, D,C [8.5] 

Abmzzi, Adam (1950 )， Experimental Procedures and Criteria for Estimating and Evaluate 
ing Industrial Productivity, doctoral dissertation, Columbia University Library* [9.7, 
9.P] 

Adrian, Robert (1808)，Research concerning the probabilities of the errors which 
happen in making observations, etc., The Analyst or Mathematical Museum, 1, 
93-109.11.2] 

Ahn ， S. K” and G. C. Reinsel (1988)，Nested reduced-rank autoregressive models for 
multiple time series. Journal of American Statistical Association, 83 ， 849-856. 
[12J] 

Aitken, A* C. (1937), Studies in practical mathematics, II. The evaluation of the latent 
roots and latent vectors of a matrix, Proceedings of the Royal Society of Edinburgh ， 
57, 269-305. [11.4] 

Amemiya, Yasuo, and T. W. Anderson (1990)，Asymptotic chi-square tests for a large 
class of factor analysis models, Annals of Statistics, 18, 1453-1463. [14*6] 

Anderson, R. and T» A. Bancroft (1952 )， Statistical Theory in Research, McGraw- 
Hill, New York. [8.P] 

Anderson ， T. W. (1946a)，The non-central Wishart distribution and certain problems 
of multivariate statistics, Annals of Mathematical Statistics, 17, 409 - 431. (Correc¬ 
tion, 35 (1964) ， 923-924.) [14.4] 

Anderson, T. W. (1946b)，Analysis of multivariate variance, unpublished. [10.6] 


An Introduction to Multivariate Statistical Analysis, Third Edition. By T. W. Anderson 
ISBN 0-471-36091-0 Copyright © 2003 John Wiley & Sons, Inc* 


687 



688 


REFERENCES 


Anderson, T. W» (1950), Estimation of the parameters of a single equation by the 
limited-information maximum-likelihood method, Statistical Inference in Dynamic 
Economic Models (Tjalling C. Koopmans, edj John Wiley & Sons, Inc., New 
York. [12.8] — 

Anderson, T ‘ W. (1951a), Classification by multivariate analysis, Pyschometrika, 16, 

31-50 ‘ [6.5, 6‘9] 

Anderson, T* W. (1951b), Estimating linear restrictions on regression coefficients for 
multivariate normal distributions, Annals of Mathematical Statistics 22, 327 - 35L 
(Correction, Annals of Statistics, 8 (1980), 1400.) [12.6, 12*7] 

Anderson, T ‘ W. (l95 】 c)，The asymptotic distribution of certain characteristic roots 
and vectors, Proceedings of the Second Berkeley Symposium on Mathematical 
Statistics and Probability (Jerzy Neyman ， ed.)，University of California, Berkeley, 
105-130. [12.7] 

Anderson, T ‘ W‘ (1955a)，Some statistical problems in relating experimental data to 
predicting performance of a production process, Journal of the American Statisti¬ 
cal Association^ 50, 163-177. [8.P] 

Anderson, T* W. (1955b), The integral of a symmetric unimodal function over a 
symmetric convex set and some probability inequalities, Proceedings of the A men- 
can Mathematical Society } 6, 170-176. [8.10] 

Anderson, T. I (1957), Maximum likelihood estimates for a multivariate normal 
distribution when some obervations are missing, Journal of the American Statisti¬ 
cal Association, 52, 676-687. [4.P] 

Anderson, T. I (1963a), Asymptotic theory for principal component analysis, Annals 
of Mathematical Statistics, 34, 122-148. [11.3, 11.6, 11,7, 13*5] 

Anderson, T. I (1963b), A test for equality of means when covariance matrices are 
unequal, Annals of Mathematical Statistics, 34, 671-672. [5.:P] 

Anderson, T. W. (1965a), Some optimum confidence bounds* for roots* of determinan- 
tal equations, Annals of Mathematical Statistics, 36, 468-488. [11.6] 

Anderson, T. W. (1965b), Some properties of confidence regions and tests of parame¬ 
ters in multivariate distributions (with discussion), Proceedings of the IBM Scien¬ 
tific Computing Symposium in Statistics, October 21-23, 1963, IBM Data Process¬ 
ing Division, White Plains, New York, 15-28. [1L6] 

Anderson, T. W* (1969)，Statistical inference for covariance matrices* with linear 
structure, Muldoariate Analysis ll (P. R. Krishnaiah, ed‘) ， Academic, New York ， 
55-66. [3.P] 

Anderson, T. W. (1971), The Statistical Analysis of Time Series, John Wiley & Sons, 
Inc” New York. [8.11] 

Anderson, T. W. (1973a), Asymptotically efficient estimation of covariance matrices 
with linear structure, Annals of Statistics, 1, 135“14h [143] 

Anderson, T. W* (1973b), An asymptotic expansion of the distribution of the Studen- 
tized classification statistic IV, Annals of Statistics, 1, 964-972. [6.6] 

Anderson, T. W. (1973c), Asymptotic evaluation of the probabilities of misclaHsifica- 
tion by linear discriminant functions, Discriminant Analysis and Applications 
(T ‘ Cacoullos, ed.), Academic, New York, 17-35* [6.6] 



REFERENCES 


689 


Anderson, T. W. (1974), An asymptotic expansion of the distribution of the limited 
information maximum likelihood estimate of a coefficient in a simultaneous 
equation system, Journal of (he American Statistical Association^ 69 ， 565—573. 
(Correction, 71 (1976) ， 1010.) [12.7] 

Anderson, T. W. (1976), Estimation of linear functional relationships: approximate 
distributions and connections with simultaneous equations in econometrics (with 
discussion), Journal of the Royal Statistical Society B, 38, I-36. [12.7] 

Anderson, T. W, (1977), Asymptotic expansions of the clistnbutions* of estimates in 
simultaneous equations for alternative parameter sequences, Econometrica, 45, 
509-518. [12.7] 

Anderson, T. W. (1984a), Estimating linear statistical relationships ， Annals of Statis¬ 
tics 1 12, 1-45. [10.6, 12.6, 12.7, 14」] 

Anderson, T* W. (1984b), Asymptotic distribution of an estimator of linear functional 
relationships, unpublished* [12/7] 

Anderson, T. W. (1987), Multivariate linear relations. Proceedings of the Second 
International Tampere Conference in Statistics (Tarmo Pukkila and Simo Puntanen. 
edsj, Tampere, Finland, 9-36. [12.7] 

Anderson, T. W. (1989a), The asymptotic distribution of the likelihood ratio criterion 
for testing rank in multivariate components of variance, Journal of Multivariate 
Analysis, 30, 12-19. [10.6, 12.7] 

Anderson^ T. W. (1989b), Th: asymptotic distribution of characteristic roots and 
vectors in multivariate components of variance, Contributions to Probability and 
Statistics: Essays in Honor of Ingram Olkin (Leon Jay Gleser，Michael D. Perlman, 
S. James Press, and Allan R. Sampson, eds.). Springer Verlag. New York. 
177-196. [13.6] 

Anderson, T. W* (1989c), Linear latent variable models and covariance structures. 
Journal of Econometrics, 41, 91 - IIU. [Correction 43 (1990), 395 J [12,8] 

Anderson, T. W, (1993)，Nonnormal multivariate distributions ： Inference based on 
elliptically contoured distributions. Multivariate Analysis: Future Directions (C. R. 
Rao, ed), North-Holland, Amsterdam, 1-25. [3.6] 

Anderson, T. W. (1994)，Inference in linear models. Proceedings of the Internaiional 
Symposium on Multivariate Analysis and ks Applications (T. W* Anderson, K* T* 
Fang, and L Olkin, eds.). Institute of Mathematical Statistics, 1-20. [ 12.7, 12.8, 
14.3] 

Anderson ， T. W. (1999a), Asymptotic theory for canonical correlation analysis. Jour¬ 
nal of Muhwariate Analysis, 70, 卜 29, [13.7] 

Anderson, T* W. (1999b). Asymptotic distribution of the reduced rank regression 
estimator under general conditions. Annals of Statistics. 21, 1141-1154. [13.7] 

Anderson, T. W‘ (2002)，Specification and misspecification in reduced rank regres¬ 
sion, Sankhya y 64A ， 1-13, [12.7] 

Anderson. T, W” and Yasuo Amemiya (1988a), The asymptotic normal distribution of 
cstim<itars in factor analysis under guiKTal conditions. Anmls of Siarhilcs. 16. 
759-77L [14,6] 、 



690 


REFERENCES 


Anderson, T. W., and Yasuo Amemiya (1988b), Asymptotic distributions in factor 
analysis and linear structural relations, Proceedings of the International Conference 
on Advances in Multivariate Statistical Analysis (S* Das Gupta and J* K» Ghosh, 
eds.), Indian Statistical Institute. Calcutta, 1-22. [14.6] 

Anderson, T. and R. R. Bahadur (1962), Classification into two multivariate 
normal distiibutions with different covariance matrices, Annals of Mathematical 
Statistics. 33, 420 - 431 ‘ [6*10] 

Anderson. T. W*, and Kai-Tai Fang (1990a), On the theory of multivariate elliptically 
contoured distributions and their applications, Statistical Inference in Elliptically 
Contoured and Related Distributions (Kai-Tai Fang and T. W* Anderson, eds.), 
Allerton Press, Inc” New York, 1-23. [4.5] 

Anderson. T. W., and Kai-Tai Fang (1990b), Inference in multivariate elliptically 
contoured distributions based on maximum likelihood, Statistical Inference in 
Elliptically Contoured and Related Distributions (Kai-Tai Fang and T. W» Ander¬ 
son. eds.X Allerton Press, Inc., New York, 201-216. [3.6] 

Anderson. T» W ‘， and Kai-Tai Fang (1992)，Theory and applications of elliptically 
contoured and related distributions, The Development of Statistics: Recent Contri- 
huiions from China (X* R* Chen，IC T. Fang, and C C Yang ， eds.)，Longman 
Scientific and Technical, Harlow, Essex, 41-62. [10.11] 

Anderson, T. W., Kai-Tai Fang, and Huang Hsu (1986)，Maximum likelihood esti¬ 
mates and likelihood-ratio criteria for multivariate elliptically contoured distribu¬ 
tions, Canadian Journal of Statistics ， 14, 55-59. [Reprinted in Statistical Inference 
in Elliptically Contoured and Related Distributions (Kai-Tai Fang and T. W* 
Anderson, eds.), Allerton Press, Inc” New York, 1990, 217—223. [3.6] 

Anderson. T, W” Somesh Das Gupta，and George P. H ‘ Styan (1972), A Bibliography 
of Multivariate Statistical Analysis, Oliver & Boyd, Edinburgh. (Reprinted by 
Robert E. Krieger, Malabar, Florida, 1977.) [Preface] 

Anderson, T, W., Naoto Kunitomo, and Takamitsu Sawa (1983a), Evaluation of the 
distribution function of the limited information maximum likelihood estimator, 
Econometrica, 50, 1009-1027. [12.7] 

Anderson, T. W., Naoto Kunitomo, and Takamitsu Sawa (1983b), Comparison of the 
densities of the TSLS and LIMLK estimators, Global Econometrics, Essays in 
Honor of Lawrence R. Klein (F» Gerard Adams and Bert Hickman, eds.), MIT, 
Cambridge, MA, 103-124* [12.7] 

Anderson, T* W., and L Olkin (1985), Maximum likelihood estimation of the parame¬ 
ters of a multivariate normal distribution, Linear Algebra and Its Applications ， 70, 
147-171. [3.2] 

Anderson ， T. W” and Michael D. Perlman (1987)，Consistency of invariant tests for 
the multivariate analysis of variance, Proceedings of the Second International 
Tampere Conference in Statistics (Tarmo Pukkila and Simo Puntaner ， eds .)， 
Tampere, Finland，225 -243. [8.10] 

Anderson, T. W.，and Michael D. Perlman (1993)，Parameter consistency of invariant 
tcs*ts* for MANOVAand related multivariate hypotheses, Statistics and Probability: 
A Raghu Raj Bahadur Festschift (J. K. Ghosh, S. K. Mitra, K. R. Parthasarathy, 
and B. L. S. Prakasa Rao, eds.), Wiley Eastern Limited, 37-62 ‘ [8*10] 



REFERENCES 


691 


Anderson ， T* W.，and Herman Rubin (1949)，Estimation of the parameters of a single 
equation in a complete system of stochastic equations, Annals of Mathematical 
Statistics, 20, 46-63. [Reprinted in Readings in Econometric Theory (J. Malcolm 
Dowling and Fred R. Glahe, edsj, Colorado Associated University, 1970, 
358-375 」 [12.8] 

Anderson, T. W” and Herman Rubin (1950)，The asymptotic properties of estimates 
of the parameters of a single equation in a complete system of stixhastic 
equations, Annals of Mathematical Statistics, 2l r 570-582. [Reprinted in Readings 
in Econometric Theory (J. Malcolm Dowling and Fred R. Glahe, eds.), Colorado 
Associated University, 1970, 376 -388 ,】 [12.8] 

Anderson, T. W., and Herman Rubin (1956), Statistical inference in factor analysis, 
Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Proba¬ 
bility (Jerzy Ncyman, ed.), Vol. V, University of California, Berkeley and Los 
Angeles, 111-150. [14.2, 143, 14.4, 14.6] 

Anderson, T. W., and Takamitsu Sawa (1973)，Distributions of estimates of coeffi¬ 
cients of a single equation in a simultaneous system and their asymptotic 
expansions, Econometrica, 41, 683-714. [12.7] 

Anderson, T* W*, and Takamitsu Sawa (1977), Tables of the distribution of the 
maximum likelihood estimate of the slope coefficient and approximations, Tech¬ 
nical Report No. 234, Economics Series, Institute for Mathematical Studies in 
the Social Sciences, Stanford University, April. [12.7] 

Anderson, T. W., and Takamitsu Sawa (1979), Evaluation of the distribution function 
of the two-stage least squares estimate, Econometrica, 47, 163-182. [12.7] 

Anderson, T. W*, and Takamitsu Sawa (1982), Exact and approximate distributions of 
the maximum likelihood estimator of a slope coefficient. Journal of the Royal 
Statistical Society B, 44, 52-62. [12.7] 

Anderson, T. W., and George P. H. Styan (1982), Cochran’s theorem, rank additivity 
and tripotent matrices, Statistics and Probability: Essays in Honor of C R Rao 
(G. Kallianpur, P* R. Krishnaiah, and J. K. Ghosh, eds.), North-Holland, Amster¬ 
dam, 1-23 ‘ [7.4] 

Anderson, T. W., and Akimichi Takemura (1982), A new proof of admissibility of 
tests in multivariate analysis, Journal of Multiuariate Analysis, 12, 457-468. [8.10] 

Andersson, Steen A., David Madigan, and Michael D. Perlman (2001), Alternative 
Markov properties for chain graphs, Scandinauian Journal of Statistics, 28, 33—85. 
[15.2] 

Barnard, M. M. (1935), The secular variations of skull characters in four series of 
Egyptian skulls, Annals of Eugenics, 6, 352-37L [8.8] 

Barndorff-Nielson, O. E. (1978), Information and Exponential Families in Statistical 
Theory, John Wiley & Sons, New York. [15*5] 

Barnes, E* W, (1899), The theory of the gamma function, Messenger of Mathematics, 
29, 64-129. [8.5] 

Bartlett, M. S. (1934), The vector representation of a sample, Proceedings of the 
Cambridge Philosophical Society, 30, 327-340. [8.3] 

Bartlett, M. S. (1937a), Properties of sufficiency and statistical tests, Proceedings of the 
Royal Society of London A, 160, 268—282. [10.2] 



692 


REFERENCES 


Bartlett, M. S. (1937b), The statistical conception of mental factors, British Journal of 
Psychology, 28, 97-104. [14.7] 

Bartlett, M. S. (1938), Further aspects of the theory of multiple regression, Proceed¬ 
ings of the Cambridge Philosophical Society, 34, 33 - 40. [14.7] 

Bartlett, M. S. (1939), A note on tests of significance in multivariate analysis, 
Proceedings of the Cambridge Philosophical Society, 35, 180H 85. [7.2, 8.6] 

Bartlett, M. S. (1947)，Multivariate analysis, Journal of the Royal Statistical Society, 
Supplement, 9 y 176-197. [8.8] 

Bartlett, M. S. (1950)，Tests of significance in factor analysis ， British Journal of 
Psychology (Statistics Section), 3, 77-85. [14.3] 

Basmann, R. L. (1957), A generalized classical method of linear estimation of 
coefficients in a structural equation, Econometrica, 25, 77-83. [12.8] 

Basmann, R. L. (1961), A note on the exact finite sample frequency functions of 
generalized classical linear estimators in two leading overidentified cases ， Journal 
of the American Statistical Association, 56 ? 619-636. [12.8] 

Basmann, R. L. (1963)，A note on the exact finite sample frequency functions of 
generalized classical linear estimators in a leading three-equation case ， Journal of 
the American Statistical Association, 58, 161-171. [12*8] 

Bennett, B t M. (1951)，Note on a solution of the generalized Behrens-Fisher 
problem, Annals of the Institute of Statistical Mathematics, 2, 87 - 90 ‘ [5*5] 

Berger, J. O. (1975), Minimax estimation of location vectors for a wide class of 
densities, Annals of Statistics, 3 ? 1318-1328. [3.5] 

Berger, J. O. (1976), Admissibility results for generalized Bayes estimators of coordi¬ 
nates of a location vector, Annals of Statistics, 4, 334-356. [3.5] 

Berger, J. O. (1980a)，A robust generalized Bayes estimator and confidence region for 
a multivariate normal mean, Annals of Statistics, 8, 716—761. [3.5] 

Berger, J. O. (1980b )， Statistical Decision Theory, Foundations, Concepts and Methods, 
Springer-Verlag, New York. [3.4, 6.2, 6.7] 

Berndt, Ernst R.，and Eugene Savin (1977)，Conflict among criteria for testing 
hypotheses in the multivariate linear regression model, Econometrica ， 45, 
1263-1277. [8.6] 

Bhattacharya, P. K. (1966)，Estimating the mean of a multivariate normal population 
with general quadratic loss function. Annals of Mathematical Statistics, 37, 
1819-1824. [3.5] 

Bimbaum, Allan (1955), Characterizations of complete classes of tests of some 
multiparametric hypotheses, with applications to likelihood ratio tests, Annals of 
Mathematical Statistics, 26, 21-36. [5.6, 8.10] 

Bjorck, A., and G. Golub (1973), Numerical methods for computing angles between 
linear subspaces, Mathematics of Computation, 27, 579-594. [12.3] 

Blackwell, David, and M. A. Girshick (1954), Theory of Games and Statistical Deci¬ 
sions^ John Wiley & Sons, New York. [6.2, 6.7] 

Blalock, M., Jr (ed.) (1971), Causal Models in the So : ial Sciences, Aldine-Atheston, 
Chicago. [15.1] 

Bonnescn, T., and W. Fenchel (1948 )， Theorie der Konuexen Korper, Chelsea, New 
York. [8.10] 



REFERENCES 


693 


Bose, R. C. (1936a), On the exact distribution and moment-coefficients of the 
D 2 -statlstic, Sankya, 2, 143-154* [3.3] 

Bose, R. C. (1936b), A note on the distribution of differences In mean values of two 
samples drawn from two multivariate normally distributed populations, and the 
definition of the D 2 -statistic, Sankhya, 2, 379—384. [3.3] 

Bose, R. C., and S. N. Roy (1938), The distribution of the studentised P 2 -statistic* 
Sankhya, 4, 19-38. [5.4] 

Bowker, A. H. (1960), A representation of Hotelling’s 7 2 and Anderson's classifica¬ 
tion statistic W in terms of simple statistics, Contributions to Probability and 
Statistics (I. Olkin, S. G. Ghurye, W* Hoeffding, W. G. Madow and H. B. Mann, 
eds.), Stanford University, Stanford* California, 142-149. [5*2] 

Bowker, A. H., and R. Stigreaves (i960, An asymptotic expansion for the distribution 
function of the H^-classification statistic ， Studies in hem Analysis and Prediction 
(Herbert Solomon, ed_)，Stanford University, Stanford, California, 293-310. [6.6] 
Box, G. E. P. (1949)，A general distribution theory for a class of likelihood criteria, 
Biometrika, 36, 317-346. [8.5, 9.4, 10.4, 10.5] 

Brava is, Auguste (1846), Analyse mathematique sur les probability des erreurs de 
situation d’un point, M 焱 moires Presentes par Divers Savants a rAcademie Roy ale 
des Sciences de VInstitut de France, 9, 255-332. [1.2] 

Brown, G. W. (1939), On the power of the L、test for equality of several variances. 
Annals of Mathematical Statistics, 10, 119-128. [10.2] 

Browne, M. W. (1974), Generalized least squares estimates in the analysis of covari¬ 
ance structures, South African Statistical Journal, 8, 1-24. [14,3] 

Chambers, Juhn M. (1977), Computational Methods for Data Analysis, John Wiley & 
Sons, New York. [12.3] 

Chan, Tony F” Gene H* Golub, and Randall J. LeVeque (1981), Algorithms for 
computing the sample variance: analysis and recommendations, unpublished. 

[3.2] ^ 

Chemoff, Herman (1956), Large sample theory, parametric case, Annals of Mathemu^ 
ical Statistics, 27, 1-22. [8.6] 

Chernoff, Herman, and N. Divinsky (1953), The computation of maximum likelihood 
estimates of linear structural equations, Studies in Econometric Method (W. C. 
Hood and T. C. Koopmans, eds.)，John Wiley & Sons, Inc.，New York. [12,8] 
Chu, S. Sylvia, and K. C. S. Pillai (1979), Power comparisons of two-sided tests of 
equality of two covariance matrices based on six criteria. Annals of the Institute of 
Statistical Mathematics, 31, 185-205. [10.6] 

Chung, Kai Lai (1974), A Course in Probability Theory, 2nd ed 、 Academic. New York. 

[ 2 . 2 ] ' 

Clemm, D. S t> P. R. Krishnaiah, and V. B. Walkar (1973). Tables for the extreme 
roots of the Wishart matrix, Journal of Statistical Computation and Simulation. 2. 
65 - 92. [10.8] 

CIunies-Ross, C. W., and R. H. Riffenburgh (1960), Geometry and linear discrimina¬ 
tion, Biometrika, 47, 185-189. [6.10] 

Cochran, W. G. (1934), The distribution of quadratic forms in a normal svsteni. with 
applications to the analysis of covariance, Proceedings of the Cambridge Philosoph' 
ical Society, 30, 178-191. [7.4 】 ^ 



694 


REFERENCES 


Constantine, A, G. (1963), Some non-cerural distribution problems in multivariate 
analysis. Annals of Mathematical Statistics t 34, 1270-1285, [12.4] 

Constnntine. A. G. (IThe distribution of Hotelling’s generalised T^ y Annals of 
Mathematical Statistics, 37, 215-225. [8.6] 

Consul. P. C. (1966). On the exact distributions of the likelihood ratio criteria for 
testing linear hypotheses about regression coefficients, Annals of Mathematical 
Statistics. 37, 1319^1330. [8,4] 

Consul. P, C (1967a), On the exact distributions of likelihcod ratio criteria for testing 
independence of sets of variates under the null hypothesis, Annals of Mathemati- 
cal StaristicSy 38, 1160-1169. [9.3] 

Consul, P. C (1967b〕，On the exact distributions of the criterion W for testing 
sphericity and in a /7-variate normal distribution, Annals of Mathematical Statis¬ 
tics, 38, 1170-1174. [10.7] 

Courant. R, and D. Hilbert (1953 )，Methods of Mathematicd Physics, Interscience, 
New York. [9.10] 

Cox. D. R, and N. Wermuth (1996 )，Multivariate Dependencies, Chapman and Hall, 
London. [15.1] 

Cramer. H. (1946). Maihematical Methods of Statistics, Princeton University, Prince¬ 
ton. [2.6. 3.2, 3.4. 6.5. 7.2] ° 

Dahin. P. Fred, and Wayne A. Fuller (1981)，Generalized least squares estimation of 
the functional multivariate linear errors in variables model, unpublished. [14.3] 

Duly, J. F. (1940), On the unbiased character of likelihood-ratio tests for indepen¬ 
dence in normal systems, Annals of Mathematical Siaiis-iics, 11, 卜 32. [9.2] 

Darlington. R. B., S. L. Weinberg, and H. J. Walberg (1973), Canonical variate 
analysis and related techniques, Review of Educational Research, 43, 433-454. 
[ 12 , 2 ] • 

Das Gupta. Somesh (1965), Optimum classification rules for classification into two 
multivariate normal populations, Annals of Mathematical Statistics, 36, 1174-1184. 
[ 6 . 6 ] 

Das Gupta. S M T, W. Anderson, and G. S. Mudholkar (1964), Monotonicity of the 
power functions* of some tests of the multivariate linear hypothesis, Annals of 
Mathertuitical Statistics, 35, 200-205. [8.10] 

David, F. N. (1937), A note on unbiased limits for the correlation coefficient, 
Biometrika. 29, 157-160. [4.2] 

David, F, N, (1938). Tables of the Ordinates and Probability Integral of the Distribution of 
the Correlation Coefficient in Small Samples ，Cambridge University, Cambridge. 
[4.2] ' 

Davis, A. W. (1968). A system of linear differential equations for the distribution of 
Hotelling^ generalized T 0 2 , Annals of Mathemadcal Statistics, 39, 815-832. [8.6] 

Davis, A. W. (1970a), Exact distributions of Hotelling’s generalized T^, Biometrika ， 
57. 187-191, [Preface, 8.6] 

Davis, A. W. (1970b), Farther applications of a differential equation for Hotelling’s 
generalized Anna/s of the Institute of Statistical Mathematics, 22, 77-87. 
[Preface. 8.6] 

Davis, A. W, (1971)，Percentile approximations for a class of likelihood ratio criteria, 
Biometrika, 58, 349-356. [10.8, 10.9] 




REFERENCES 


695 


Davis ， A. W. (1972a), On the marginal distributions of the latent roots of the 
multivariate beta matrix, Annals of Mathematical Statistics, 43, 1664—1670. [8.6] 
Davis, A. W. (1972b), On the distributions of the latent roots and traces of certain 
random matrices, Journal of Multiuariate Analysis, 2 ? 189-200. [8.6] 

Davis, A. W. (1980), Further tabulation of Hotelling’s generalized T 0 2 , Communica¬ 
tions in Statistics^ B9 ? 321-336. [Preface, 8.6] 

Davis, A. W., and J, B, F. Field (1971), Tables of some multivariate test criteria, 
Technical Report No* 32, Division of Mathematical Statistics, C.S.I.R.O., 
Canberra, Australia, [10.8] 

Davis, Harold T. (1933), Tables of the Higher Mathematical Functions, VoL I ? Principia 
Press, Bloomington, Indiana, [8.5] 

Davis, Harold T, 0935 )，Tables of the Higher Mathematical Functions, Vol, II, Principia 
Press, Bloomington, Indiana. [8.5] 

Deemer, Walter L., and Ingram Olktn (1951), The Jacobians of certain matrix 
transformations useful in multivariate analysis. Based on lectures of P r L r Hsu at 
the University of North Carolina, 1947, Biometrika, 38, 345-367, [13.2] 

De Groot, Morris H. (1970), Optimal Statistical Decisions, McGraw-Hill, New York, 
[34, 6,2, 6,7] 

Dempster, A, P. (1972)，Covariance selection. Biometrics, 28, 157-175, [15,5] 
Dempster, A. P., N, M, Laird, and D. B. Rubin (1977)，Maximum likelihood from 
incomplete data via the EM algorithm (with discussion). Journal of the Royal 
Statistical Society 5, 39, 1-38, [14.3] 

Diaconis, Pcrsi, and Bradley Kfron (1983), CompULcr-intcnsivc methods in statistics. 
Scientific American, 248, 116-130. [43] 

Eaton, M. L., and M. D. Perlman (1974)，A monotonicity property of the power 
functions of some invariant fests for MANOVA, Annals of Statistics, 2 f 1022-1028. 
[8 JO] 

Edwards, D. (1995 )，Introduction to Graphical Modellings Springer-Verlag, New York. 
[15.1] J 

Efron, Bradley (1982), 77ie Jackknife，the Bootstrap, and Other Resampling Plars, 
Society for Industrial and Applied Mathematics, Philadelphia. [42] 

Efron, Bradley, and Carl Morris (1977) ， Stein’s paradox in statistics, Scientific Ameri¬ 
can, 236, 119-127. [3.5] 

Elfving, G. (1947), A simple method of deducing certain distributions connected with 
multivariate sampling, Skandinauisk Aktuarietidskrift, 30, 56-74, [7.2] 

Fang ， Kai-Tai, Samuel Kotz, and Kai-Wang Ng (1990 )，Symmetric Multiuariate and 
Related Distributions, Chapman and Hall, New York. [2*7] 

Fang, Kai-Tai，and Ya6-Ting Zhang (1990 )，Generalized Multiuariate Analysis, 
Springer-Verlag, New York. [2.7, 3,6, 10.11, 7.9] 

Ferguson, Thomas Shelburne (1967 )，Mathematical Statistics: A Decision Theoretic 
Approach ， Academic, New York. [3.4, 6.2, 6,7] 

Fisher, R, A, (1915), Frequency distribution of the values of the correlation coeffi¬ 
cient in samples from an indefinitely large population, Biometrika^ 10, 507—521, 
[4.2, 7.2] 

Fisher, R. A. (1921), On the “probable error”of a coefficient of correlation deduced 
from a small sample, Metron, 1, Part 4, 3-32. [4.2] 



696 


REFERENCES 


Fisher, R. A. (1924), The distribution of the partial correlation coefficient, Metron, 3, 
329-332. [4.3] 

Fisher, R. A. (1928), The general sampling distribution of the multiple correlation 
coefficient, Proceedings of the Royal Society of London ， A ， 121, 654-673. [4.4] 
Fisher, R. A. (1936), The use of multiple measurements in taxonomic problems, 
Annals of Eugenics, 7, 179 - 188. [5,3, 6.5, 11.5] 

Fisher, R. A. (1939), The sampling distribution of some statistics obtained from 
non-lmear equations, Annals of Eugenics, 9, 238-249. [13.2] 

Fisher, R A. (1947a), The Design of Experimems (4th ed.)，Oliver and Boyd, 
Edinburgh. [8.9] 

Fisher, R. A. (1947b), The analysis of covariance method for the relation between a 
part and the whole, Biomeirics^ 3, 65 - 68. [3.P] 

Fisher, R. A., and F. Yates (1942), Statistical Tables for Biological, Agricultural and 
Medical Research (2nd ed_)，Oliver and Boyd, Edinburgh. [4.2] 

Fog, David (1948), The geometrical method in the theory of sampling, Biometrika, 35, 
46-54. [7.2] 

Foster, F. G. (1957), Upper percentage points of the generalized beta distribution ， II ， 
Biometrika, 44, 441-453. [8.6] 

Foster, F. G. (1958)，Upper percentage points of the generalized beta distribution. III, 
Biometrika, 45, 492-503. [8.6] 

Foster, F. G.，and D. H. Rees (1957), Upper percentage points of the generalized beta 
distribution, I, Biometrika, 44, 237 - 247. [8.6] 

Frets, G. P. (1921), Heredity of head form in man, Genetica, 3, 193-384. [3.P] 

Frisch, R. (1929), Correlation and scatter in statistical variables, Nordic Statistical 
Journal ， 8, 36-102. [7.5] 

Frydenberg, M. (1990)，The chain graph Markov property, Scandinavian Journal of 
Statistics, 17, 333-353. [15.2] 

Fujikoshi, Y. (1973)，Monotonicity of the power functions of some tests in general 
MANOVA models, Annab of Statistics, 1, 388-391. [8.6] 

Fujikoshi, Y. (1974), The likelihood ratio tests for the dimensionality of regression 
coefficients. Journal of Multivariate Analysis^ 4, 327-340. [12.4] 

Fujikoshi ， Y.，and M. Kanazawa (1976)，The ML classification statistic in covariate 
discriminant analysis and its asymptotic expansions, Essays in Probability and 
Statistics, 305 - 320. [6.6] 

Fuller，Wayne A. ， Sastry, G. Pantula, and Yasuo Amemiya (1982)，The covariance 
matrix of estimators for the factor model, unpublished. [14.4] 

Gabriel, K. R. (1969), Simultaneous test procedures — some theory of multiple com¬ 
parisons, Annals of Mathematical Statistics y 40, 224-250. [8.7] 

Gajjar, A. V. (1967), Limiting distributions of certain transformations of multiple 
correlation coefficients, Metron, 26, 189-193. [4.4] 

Galton, Francis (1889), Natural Inheritance, MacMillan, London. [1.2, 2.5] 

Gauss, K. F. (1823), Theory of the Combination of Observations, Gottingen. [1.2] 

Giri. N. (1977), Multivariate Statistical Inference, Academic, New York. [7,2, 7.7] 

Giri, N.，and J. Kiefer (1964), Local and asymptotic minimax properties of multivari¬ 
ate tests, Annals of Mathematical Statistics, 35, 21-35. [5.6] 



REFERENCES 


697 


Giri, N. ? J. Kiefer, and C Stein (1963), Minimax character of Hotelling’s test in 
the simplest case, Annals of Mathematical Statistics t 34, 1524 - 1535. [5.6] 

Girshick, M, A, (1939)，On the sampling theory of roots of determinantal equations. 
Annals of Mathematical Statistics, 10, 203 - 224, [13,2] 

Gleser, Leon Jay (1981)，Estimation in a multivariate “errors in variables’, regression 
model ： large sample results, Arvuds of Statistics, 9, 24-44. [\2.1] 

Glynn ， W. J. T and R, J. Muirhead (1978)，Inference in canonical correlation analysis. 

Journal of Multivariate Analysis^ 8, 468 - 478, [12.4] 

Golub, Gene H.，and Franklin T. Luk (1976)，Singular value decomposition: applica¬ 
tions and computations, unpublished. [12.3] 

Golub, Gene H., and Charles F. Van Loan (1989), Matrix Computations (2nd ed.). 

Johns Hopkins University Press, Baltimore. [11.4, 12*3, A.5] 

Grubbs, F. E (1954), Tables of 1% and 5% probability levels of Hotelling's general¬ 
ized T 2 statistics, Technical Note No. 926, Ballistic Research Laboratory, 
Aberdeen Proving Ground, Maryland. [8.6] 

Gupta, Shanti S. (1963), Bibliography on the multivariate normal integrals and related 
topics. Annals of Mathematical Statistics, 34, 829-838. [2.3] 

Gurland, John (1968)，A relatively simple form of the distribution of the multiple 
correlation coefficient ，Journal of the Royal Statistical Society B, 30. 276-283. [4,4] 
Gurland, J., and R. Milton (1970), Further consideration of the distribution of the 
multiple correlation coefficient, Journal of the Royal Statistical Socie^ B, 32. 
381-394. [4.4] ^ 

Haavelmo, T. (1944), The probability approach in econometrics, Econometrica^ 12. 
Supplement , 卜 118, [12.7] 

Haff, L. R. (1980), Empirical Bayes estimation of the multivariate normal covariance 
matrix. Annals of Statistics, 8 , 586-597. [7.8] 

Halmos, P. R. (1950), Measure Theory, D. van Nostrand, New York, [4,5, 13.3] 

Harris, Bernard, and Andrew P, Soms (1980). The use of the tetrachoric series for 
evaluating multivariate rormal probabilities, Journal of Multwariate Analysis, 10. 
252-267. [2.3] 

Hayakawa, Takesi (1967), On the distribution of the maximum latent root of a 
positive definite symmelric random matrix, Armais of the Institute of Statistical 
Mathematics, 19, 1-17. [f!.6] 

Heckt D, L (I960), Charts of some upper percentage points of the distribution of the 
largest characteristic root, Annals of Mathematical Statistics, 31, 625-642. [8.6] 
Hickman, W. BraddCK：k (1953), The Volume of Corporate Bond Financing Since 1900, 
Princeton University, Princeton, 82-90, [10,7] 

Hoel, Paul G. (1937), A significance test for component analysis, Annals of Mathemat¬ 
ical Statistics, 8, 149-158, [7.5] 

Hooker, R, H, (1907), The correlation of the weather and crops. Journal of the Royal 
Statistical Society, 70, 1-42. [4,2] 

Hotelling, Harold (1931)，The generalization of Student’s ratio. Annals of Matlwnaii- 
cat Statistics, 2, 360—378. [5 山 5,P] 

Hotelling, Harold (1933), Analysis of a complex of statistical variables into principal 
components, Journal of Educational Psychology, 24, 417-441. 498-520. [1 Ul, 14,3] 



698 


REFERENCES 


Hotelling. Harold 11935), The most predictable criterion, Journal of Educational 
Psychology, 26. 139-142. [12.2] 

Hotelling, Harold (1936)、Relations between two sets of variates, Biometrika, 28, 
321-377. [12.1] 

Hotelling. Harold (1947). Multivariate quality control, illustrated by the air testing of 
sample bombsights. Techniques of Statistical Analysis (C. Eisenhart, M. Hastay, 
and W. A. Wallis 、 eds.). McGraw-Hill, New York, 111- IS4. [8.6] 

Hotelling. Harold (1950. A generalized T te、st and measure of multivariate disper¬ 
sion. Proceedings of the Second Berkeley Symposium on Mathematical Statistics 
and Probability (Jerzy Neyman. ed_)、University of California，Los Angeles and 
Berkeley^ 23 - 41. [8.6. 10.7] 

Hotelling, Harold (1953). New light on the correlation coefficient and its transforms 
(with discussion), Journal of the Royal Statistical Society B, 15, 193 - 232. [4,2] 

Howe. W. G. (1955). Some Contributions to Factor Analysis, U.S, Atomic Energy 
Commission Report. Oak Ridge National Laboratory, Oak Ridge, Tennessee. 
[14.2, 14.6] i 

Hsu. P. L. (1938), Notes on Hotelling’s generalized 7, Annals of Mathematical 
Statistics. 9 、 231-243. [5.4] 

Hsu. P. L. (1939a), On the distribution of the roots of certnin determinantal equa¬ 
tions, Annals of Eugenics, 9, 250-258. [13.2] 

Hsu. P. L. (1939b), A new proof of the joint product moment distribution, Proceedings 
of the Cambridge Philosophical Society 1 , 35, 336-338. [7.2] 

Hsu, P. L. (1945), On the power functions for the £ 2 -test and the T^test, Annals of 
Mathematical Statistics, 16, 278-286. [5.6] 

Hudson. M. (1974), Empirical Bayes estimation, Technical Report No, 58, NSF 
contract GP 3071IX-2, Department of Statistics，Stanford University, [3.5] 

Immer, F. R.. H. D. Hayes, and LeRoy Powers (1934)，Statistical determination of 
barley varietal adaptation, Journal of the American Society of Agronomy ， 26, 
403-407. [8.9] 

Ingham, A. E. (1933)、An integral which occurs in statistics, Proceedings of the 
Cambridge Philosophical Society, 29, 27 卜 276. [7.2] 

lto, K. (1956), Asymptotic Formulae for the distribution of Hotelling’s generalized Tq 
statistic. Annals of Mathematical Statistics, 27, 1091-1105. [8,6] 

lto. K. (1960), Asymptotic Formulae for the distribution of Hotelling's generalized Tq 
statistic, 11, Annals of Mathematical Statistics^ 31 1 1148 - 1153, [8,6] 

lzenman, A. J, (1975), Reduced-rank regression for the multivariate linear model. 
Journal of Multivariate Analysis, 5, 248-264. [12.7] 

Izenman, Alan Julian (1980), Assessing dimensionality in multivariate regression, 
Analysis of Variance, Handbook of Statistics, Vol. 1 (P. R. Krishnaiah, ed), 
NortlvHolland. Amsterdam, 571-591, [3,P] 

James, A. T. (]954), Normal multivariate analysis and the orthogonal group, Annals of 
Mathematical Statistics, 25, 40—75. [12] 

James, A. T. (1964), Distributions of matrix variates and latent roots derived from 
normal samples, Annals of Mathematical Statistics, 35, 475-501, [8,6] 



REFERENCES 


699 


James, W, s and C. Stein (1961), Estimation with quadratic loss. Proceedings of the 
Fourth Berkeley Symposium on Mathematical Statistics and Probability (Jerzy 
Neyman, ed，X Vol, I, 361—379, University of California, Berkeley. [3.5, 7.8] 

Japanese Standards Association (1972), Statistical Tables and Formulas with Computer 
Applications, [Preface] 

Jennrich, Robert I” and Dorothy T, Thayer (1973)，A note on Lawley*s formulas for 
standard errors in maximum likelihood factor analysis, Psychometrika ， 38 ， 
571-580. [14.3] 

Johansen ， S. (1995), Likelihood-based Inference in Cointegrated Vector Autoregression 
Models, Oxford University Press. [12,7] 

Jolicoeur, Pierre, and J, E, Mosimann (I960), Size and shape variation in the painted 
turtle, a principal component analysis, Growth ， 24 ， 339—354, Also in Benchmark 
Papers in Systematic and Evolutionary Biology (E. H, Bryant and W ， R, Atchley ， 
eds.)，2 (1975), 86-101. [ILP] ^ 

JSreskog ， K, Gv (1969) A general approach to confirmatory maximum likelihood 
factor analysis, Psychometrika, 34, 183-201 [14,2] 

Jfireskog, K. G,，and Arthur S. Goldberger (1972), Factor analysis by generalized least 
squares, Psychometrika, 37, 243-260, [14.3] 

Kaiser, Henry R (1958), The varimax criterion for analytic rotation in factor analysis, 
Psychometrika, 23, 187-200, [14,5] 

Kanazawa, M. (1979)，The asymptotic cut-off point and comparison of error probabili¬ 
ties in covariate discriminant analysis, Journal of the Japan Statistical Society, 9, 
7-17. [6,6] - 

Kelley, T. L (1928), Crossroads in the Mind of Man ， Stanford University, Stanford. 
[4.P, 9,P] 

Kendall, M, G,, and Alan Stuart (1973), The Advanced Theory of Statistics (3rd ed), 
Vol, 2, Charles Griffin, London. [12,6] 

Kennedy, William J., Jr,, and James E Gentle (1980), Statistical Computing, Marcel 
Dekker, New York. [12,3] 

Khatri, C, G. (1963), Joint estimation of the parameters of multivariate normal 
populations. Journal of Indian Statistical Association, 1, 125-133. [7,2] 

Khatri, C. G. (1966), A note on a large sample distribution of a transformed multiple 
correlation coefficient, Annals of the Institute of Statistical Mathematics., 18, 
375-380, [4.4] 

Khatri, C, G. (1972), On the exact finite series distribution of the smallest or the 
largest root of matrices in three situations. Journal of Multivariate Analysis, 2 f 
201-207. [8,6] 

Khatri, C, G,, and K. C Sreedharan Pillai (1966), On the moments of the trace of a 
matrix and approximations to its non-central distribution, Annals of Mathematical 
Statistics, 37, 1312-1318, [8,6] 

Khatri, C and K, C S, Pillai (1968) T On the non-central distributions of two test 
criteria in multivariate analysis of variance, Annals of Mathematical Statistics, 39, 
215-226. [8,6] ^ 

Khatri, C G., and K. V, Ramachandran (1958), Certain multivariate distribution 
problems, 1 (Wishart’s distribution), Journal of the Maharaja Sayajairo, University 
of Baroda, 7, 79-82. [7,2] ' 



700 


REFERENCES 


Kiefer, J. (1957), Invariance, minimax sequential estimation, and continuous time 
processes, Annals of Mathematical Statistics, 28, 573-601. [7.8] 

Kiefer, J. (1966), Multivariate optimality results, Multivariate Analysis (Parachuyi R. 
Krishnaiah, ed.) ， Academic, New York, 255-274. [7.8] 

Kiefer, J., and R. Schwartz (1965)，Admissible Bayes character of T\ and other 
fully invariant tests for classical multivariate normal problems, Annals of Mathe¬ 
matical Statistics, 36, 747-770. [5.P, 9.9, 10.10] 

Klotz, Jerome, and Joseph Putter (1969), Maximum likelihood estimation of the 
multivariate covariance components for the balanced one-way layout. Annals of 
Mathematical Statistics, 40, 1100-1105. [10.6] 

Kolmogorov, A. (1950), Foundations of the Theory of Probability, Chelsea 〗 New York. 

[ 2 . 2 ] 

Konishi，Sadanori (1978a)，An approximation to the distribution of the sample 
correlation coefficient, Biometrika, 65, 654-656. [4.2] 

Konishi, Sadanori (1978b), Asymptotic expansions for the distributions of statistics 
based on a correlation matrix ，Canadian Journal of Statistics, 6, 49-56. [4.2] 
Konishi, Sadanori (1979), Asymptotic expansions for the distributions of functions of a 
correlation matrix, Journal of Multivariate Analysis, 9, 259-266. [4.2] 

Koopmans, T. C., and Olav Rcicrs^l (1950)，The identification of structural character¬ 
istics, Annals of Mathematical StatiHtics^ 21, 165-181. [14.2] 

Korin, B. P. (1968)，On the distribution of a statistic used for testing a covariance 
matrix, Biometrika, 55, 171-178. [10.8] 

Korin, B. P. (1969)，On testing the equality of k covariance matrices, Biometrika, 56 ， 
216—218. [10,5] 

Kramer, K. H. (1963)，Tables for constructing confidence limits on the multiple 
correlation coefficient, Journal of the American Statistical Association, 58, 
1082-1085. [4.4] 

Krishnaiah, P. R. (1978), Some recent developments on real multivariate distributions ， 
Development in Statistics (P. R. Krishnaiah ， ed.) ， Vol. 1 ， Academic, New York, 
135-169. [8.6] 

Krishnaiah, P. R. (1980), Computations of some multivariate distributions, Analysis of 
Variance，Handbook of Statistics, Vol. 1 (P. R. Krishnaiah, ed.) ， North-Holland, 
Amsterdam, 745-971. 

Krishnaiah, P. R,，and T. C. Chang (1972)，On the exact distributions of the traces of 
S 1 (S,+S 2 r I and S,S 2 _I , Sankhya, A, 34, 153-160. [8.6] 

Krishnaiah, P. R., and F. J. Schuurmann (1974), On the evaluation of some distribu¬ 
tions that arise in simultaneous tests for the equality of the latent roots of the 
covariance matrix, Journal of Multivariate Analysis, 4, 265-282. [10.7] 

Kshirsagar, A. M. (1959), Bartlett decomposition and Wishart distribution, Annals of 
Mathematical Statistics y 30, 239-241. [7.2] 

Kudo, H_,(1955), On minimax invariant estimates of the transformation parameter, 
Natural Science Report, 6, 31-73, Ochanomizu University, Tokyo, Japan. [7.8] 

Kunitomo, Naoto (1980)，Asymptotic expansions of the distributions of estimators in a 
linear functional relationship and simultaneous equations, Journal of the Ameri¬ 
can Statistical Association^ IS, 693-700. [12.7] 



REFERENCES 


701 


Lachenbmch ， P, A,，and M. R. Mickey (1968), Estimation of error rates in discrimi¬ 
nant analysis, Technometrics, 10, 1-11. [6.6] 

Laplace, P. S. (1811)，Memoire sur les integrates definies et leur application aux 
probabilites, Memoires cle I'Institut Imperial de France, Annee 1810, 279-347. [1.2] 
Lauritzen, Steffen L. (1996 )， Graphical Models, Clarendon Press, Oxford. [15.5] 
Lauritzen, Steffen L., and N. Wermuth (1989)，Graphical models for associations 
between variables some of which are qualitative and some quantitative, Annals of 
Statistics, 17, 31-57 ‘ [15.2] 

Lauter, Jurgen, Ekkehard Glimm，and Siegfried Kropf (1996a), New multivariate tests 
for data with an inherent structure, Biometrics Journal^ 38,5-23. {Correction: 40. 
(1998) ， 1015.] [5.7] 

Lauter, Jurgen, Ekkehard Glimm, and Siegfried Kropf (1996b), Multivariate tests 
based on left-spherically distributed linear scores ， Armais of Statistics. 26. 
1972-1988. [5.7] — 

Lauter, Jiirgen, Ekkehard Glimm, and Siegfried Kropf (1996c). Exact stable multivari¬ 
ate tests for applications in clinical research. ASA Proceedings of the Riophanna- 
ceutical Section, 46-55. [5.7] 

Law ley, D. N. (1938), A generalization of Fisher's z test, Btometrika, 30, 180-187. 

[ 8 . 6 ] 

Lawlcy, D, N. (1940X The estimation of factor loadings by the method of nuixinium 
likelihood. Proceedings of the Royal Society of Edinburgh, Sec^ A, 60, 64-82. [14,3] 
Lawley, D. N. (1941)，Further investigations in factor estimation, Proceedings of the 
Royal Society of Edinburgh, Sec. A , 61, 176-185. [14.4] 

Law ley, D. N. (1953), A modified method of estimation in factor analysis and some 
large sample results, Uppsala Symposium on Psychological Factor Arutlysh. 17-19 
March 1953, Uppsala, Almqvist and Wjksell. 35-42. [14.3] 

Lawley, D. N. (1958)，Estimation in factor analysis under various initial assumptions. 

British Journal of Statistical Psychology, 11, 1-12. [14.2, 14.6] 

Law ley, D. N. (1959), Tests of significance in canonical analysis, Biometrika. 46. 
59-66. [114] 

Lawley, D. N. (1967)，Some new results in maximum likelihojKJ factor analysis. 
Proceedings of the Royal Society of Edinburgh ， Sec, A, 87. 256-264. [143] 

Lawley, D. N., and A. R Maxwell (1971 )， Factor Analysis as a Statistical Method (2nd 
ed*), American Elsevier，New York. [14.3, 14.5] 

Lee, Y. S. (1971a), Asymptotic formulae for the distribution of a multivariate test 
statistic: power comparisons of certain multivariate tests. Biometrika. 58, 647-65 L 
[ 8 . 6 ] 

Lee, Y, S. (1971b), Some results on the sampling distribution of the multiple 
correlation coefficient, Journal of the Royal Statistical Society^ B, 33. 117-130. [4,4] 
Lee, Y. S. (1972), Tables of upper percentage points of the multiple correlation 
coefficient, Biometrika, 59, 175-189. [4,4] 

Lehmann, E L (l959) s Testing Statistical Hypotheses^ John Wiley & Son’s，New York. 
[4.2, 5.6] 

Lehmer, Emma (1944)，Inverse tables of probabilities of errors of the second kind. 
Annals of Mathematical Statistics, 15, 388-398. [5.4] 



702 


REFERENCES 


Loeve. M. (1977). Probability Theory I (4th ed.), Springer-Verlag, New York. [2.2] 

Loeve, M. (1978), Probability Theory II (4th ed.), Springer-Verlag, New York. [2.2] 

Madow. W. G. (1938)、Contributions to the theory of multivariate statistical analysis, 
Transactions of the Anierican Mathematical Sociery, 44, 454-495. [7.2] 

Magnus, Jan R. (1988), Linear Structures, Charles Griffin and Co., London. [A.4] 

Magnus, J. R,，and H. Neudecker (1979), The commutation matrix: some properties 
and applications, The Annals of Statistics, 7, 381-394. [3.6] 

Mahalanobis, P. C. (1930). On tests and measures of group divergence, Journal and 
Proceedings of the Asiatic Society of Bengal, 26 、 541—588, [3.3] 

Msihalanobis, P. C, R. C. Bose, and S. N. Roy (1937), Normalisation of statistical 
variates and the use of rectangular co-ordinates in the theory of sampling 
distributions, Sankhya, 3, 1-40, [7.2] 

Mallows, C L. (1961), Latent vectors of random symmetric matrices, Biometrika, 48, 
133-149. [11.6] 

Mardia, K. V. (1970), Measures of multivariate skewness and kurtosis with applica¬ 
tions, Biometrika, 57, 519-530, [3,6] 

Mariano, Roberto S„ and Takamitsu Sawa (1972)、The exact finite-sample distribu¬ 
tion of the limited-information maximum likelihood estimator in the case of two 
included endogenous variables, Journal of the American Statistical Association, <7, 
159-163. [12.7] 

Maronna ， R, A. (1976)，Robust A/-estimators of multivariate location and scatter, 
Annals of Statistics, 4, 51-67. [3.6] 

Marshall, A. W,，and 1， Olkin (1979), Inequalines: Theory of Majorization and Its 
Applications, Academic, New York. [8.10] 

Mathai, A. M, (1971), On the distribution of the likelihood ratio criterion for testing 
linear hypotheses on regression coefficients, Annals of the Institute of Statistical 
Mathematics, 23, 181-197. [8.4] 

Mathai 、 A. M., and R. S. Katiyar (1979), Exact percentage points for testing indepen¬ 
dence, Bioruetrika, 66, 353-356. 19.3] 

Mathui. A. M. t and P. N. Rathie (1980), The exact non-null distribution for testing 
equality of covariance matrices, Sankhya A, 42, 78-87. [10,4] 

Mathai, A. M., and R, K. Saxena (1973 )， Generalized Hypergeometric Functions with 
Applications in Statistics and Physical Sciences 、 Lecture Notes No. 348, Springer- 
Verlag, New York. [9.3] 

Mauchlv, J. W. (1940 乂 Significance test for sphericity of a normal rt-variate distribu¬ 
tion, Annals of Mathematical Statistics, 11, 204-209. [10.7] 

Mauldon ， J. G. (1955)，Pivotal quantities for Wishart\s and related distributions, and a 
paradox in fiducial theory, Journal of the Royal Statistical Society B, 17, 79-85, 
[7.2] 

McDonald, Roderick P. (2002), What can we learn from path equations: identifiabil- 
ity ， constraints, equivalence, Psychometrika, 67, 225-249. [15.1] 

McLachlan, G. J. (1973), An asymptotic expansion of the expectation of the estimated 
error rate in discriminant analysis, Australian Journal of Statistics, IS, 210-214. 
[ 6 . 6 ] 



REFERENCES 


703 


McLachlan, G. J, (1974a), An asymptotic unbiased technique for estimating the error 
rates in discriminant analysis. Biometrics, 30. 239-249. [6.6] 

Me Lachlan, G, J, (1974b), Estimation of the err ⑽ of miscln^ification on the 
criterion of asymptotic mean square error, Technometrics, 16, 255^260, [6,6] 
McLachlan, G, J. (1974c), The asymptotic distributions of the conditional error rate 
and risk in discriminant analysis, Biometrika, 61, 131-135, [6.6] 

McLachlan, G. J. (1977)，Constrained sample discrimination with the studentized 
classification statistic Communications in Statistics — Theory and Methods' A6, 
575-583 [6,6] 

Memon, A, and M. Okamoto (1971), Asymptotic expansion of the distribution of 
The Z statistic in discriminant analysis. Journal of Multivariate Analysis 、 1, 
294-307. [6.6] ° 

Mijares, T. A. (1964), Percentage Points of the Sum of s Roots = 1 — 50), The 
Statistical Center, University of the Philippines, Manila, [8,6] 

Mikhail, N. N. (1965), A comparison of tests of the Wilks-Lawley hypothesis in 
multivariate analysis, Biometrika, 52, 149-156, [8.6] 

Mood, A. M, (1951)，On the distribution of the characteristic roots of normal 
second-moment matrices, Annals of Mathematical Statistics, 72, 266—273. [13.2] 
Morris, Blair, and Ingram Olkin (1964), Some estimation and testing problems for 
factor anafysis models ， unpublished. [10.6] 

Mudholkar, G. S. (1966)，On confidence bounds associated with multivariate analysis 
of variance and non-independence between two sets of variates, Annals of 
Mathematical Statistics, 37, 1736-1746. [8.7] 

Mudholkar, Govind S_, and Madhusudan C. Trivedi (1980), A normal approximation 
for the distribution of the likelihood ratio statistic in multivariate analysis of 
variance, Biometrika, 67, 485-488. [8.5] 

Mudholkar, Govind S,，and Madhusudan C. Trivedi (1981), A normal approximation 
for the multivariate likelihood ratio statistics, Statistical Distributions in Scientific 
Work (C. Taillie et al., eds.), Vol_ 5, 219-230, D, Reidel Publishing. [8,5] 
Muirhead, R. J. (1970)，Asymptotic distributions of some multivariate tests, Annals of 
Mathematical Statistics, 41, 1002 1010. [8.6] 

Muirhead, R, J. (1980)，The eflects of elliptical distributions on some standard 
procedures involving correlation coefficients; a review, Multivariate Statistical 
Analysis (R. P. Gupta, ed.), 143-159. [4,5] 

Muirhead, Robb J. (1982), Aspects of Multiuariaie Statistical Theory ， John Wiley and 
Sons，New York, [2.7, 3,6, 7,7] 

Muirhead, R. J,，and C. M. Watemaux (1980), Asymptotic distributions in canonical 
correlation analysis and other multivariate procedures for nonnormal popula¬ 
tions, Biometrika, 67, 31-43, [4,5] 

Nagao, Hisao (1973a), On some test criteria for covariance matrix. Annals of Statistics, 
h 700-709. [9.5, 10.2,10.7, 10.8] 

Nagao, Hisao (1973b)，Asymptotic expansions of the distributions of Bartlett's test and 
sphericity test under the local alternatives, Armais of the Institute of Statistical 
Mathematics, 25, 407—422, [10.5, 10.6] 

Nagao, Hisao (1973c), Nonnull distributions of two test criteria for independence 
under local alternatives, Journal of Multivariate Analysis, 3, 435-444. [9,4] 



704 


REFERENCES 


Nagarsenker, B, N., and K. C. S, Pillai (1972), The Distribution of the Sphericity Test 
Criterion, ARL 72-0154, Aerospace Research Laboratories, [Preface] 

Nagarsenker, B. N,, and K, C, S. Pillai (1973a), The distribution of the sphericity test 
criterion, Journal of Multivariate Analysis, 3, 226-235. [10.7] 

Nagarsenker, B. N., and K. C. S. Pillai (1973b)，Distribution of the likelihood ratio 
criterion for testing a hypothesis specifying a covariance matrix, Biometrika, 60, 
359-364. [10.8] ^ 

Nagarsenker, B. N., and K. C, S. Pillai (1974)，Distribution of the likelihood ratio 
criterion for testing X = 2q» = Journal of Multiuaric te Analysis ， 4, 114-122. 

[10.9] 

Nanda ， D, N, (1948)，Distribution of a root of a determinantal equation. Annals of 
Mathematical Statistics, 19, 47-57. [8.6] 

Nanda, D. N. (1950)，Distribution of the sum of roots of a determinantal equation 
under a certain condition, Annals of Mathematical Statistics, 21, 432—439, [8,6] 
Nanda, D, N, (1951), Probability distribution tables of the larger root of a determinan¬ 
tal equation with two roots, Journal of (he Indian Society of AgriatUuml Statistics, 
3, 175-177. [8,6] ^ 

Narain ， R. D. (1948)，A new approach to sampHrg distributions of the multivariate 
normal theory, 1, Journal of the Indian Society of Agricultural Statistics, 1, 59-69, 

[7.2] l 

Nanin, R. D. (1950)，On the completely unbiased character of tests of independence 
in multivariate normal systems, Annals of Mathematical Statistics, 21 t 293-298. 

[9.2] 

National Bureau of Standards，United States (1959), Tables of the Bivariate Normal 
Distribution Function and Related Functions, U,S, Government Printing Office, 
Washington, D.C, [2.3] 

Neveu, Jacques (1965), Mathematical Foundations of the Calculus of Probability ， 
Holden-Day t San Francisco. [2.2] 

Ogawa, J. (1953)，On the sampling distributions of classical statistics in multivariate 
analysis ， Osaka Mathematics Journal, 5, 13-52. [7,2] 

Okamoto, Masashi (1963), An asymptotic expansion for the distribution of the linear 
discriminant function, Annals of Mathematical Statistics, 34, 1286—1301, (Correc¬ 
tion, 39 (1968) ， 1358-1359.) [6.6] 

Okamoto, Masashi (1973)，Distinctness of the eigenvalues of a quadratic form in a 
multivariate sample ， Annals of Statistics, 1, 763-765. [13*2] 

Olkin ， Ingram, and S. N_ Roy (1954)，On multivariate distribution theory, Annals of 
Mathematical Statistics, 25 t 329 — 339, [7.2] 

Olson, C. L (1974), Comparative robustness of six tests in multivariate analysis of 
variance, Journal oj the American Statistical Association, 69, 894-908, [8;6] 

Pearl, Judea (2000), Causality ： Models, Reasoning, and Inference, Cambridge Univer¬ 
sity Press, Cambridge. [15.1] 

Pearson, E, S„ and H, O, Hartley (1972 )， Biometrika Tables for Statisticians, Vol, II ， 
Cambridge (England), Published for the Biometrika Trustees at the University 
Press, [Preface, 8.4] 

Pearson ， E. S,，and S. S, Wilks (1933), Methods of statistical analysis appropriate for k 
samples of two variables, Biometrika, 25, 353-378. [10.5] 



REFERENCES 


705 


Pearson, K, (1896), Mathematical contributions to the theory of evolution111, 
Regression, heredity and panmixia, Philosophical Transactions of the Royal Society 
of London, Series A, 187, 253-318, [2.5, 3,2] 

Pearson, K, (1900), On the criterion that a given system of deviations from the 
probable in the case of a correlated system of variables is such that it can be 
reasonab'y supposed to have arisen from random sampling, Philosophical Maga¬ 
zine, 50 (fifth series), 157-175, [3.3] 

Pearson, K. (1901), On lines and planes of closest fit to systems of points in space, 
Philosophical Magazine, 2 (sixth series), 559-572. [11.2] 

Pearson, K, (1930), Tables for Statisticians and Biometricians, Part I (3rd ed,). Cant- 
bridge University, Cambridge, [2,3] , 

Pearson. K. (1931), Tables for Statisticians and Biometricians M Part II, Cambridge 
University, Cambridge, [2.3, 6.8] 

Perlman, M. D_ (1980), Unbiasedness of the likelihood ratio tests for equality of 
several covariance matrices and equality of several multivariate normal popula¬ 
tions, Annals of Statistics, 8, 247-263. [10.2, 10.3] 

Perlman, M. D„ and 1. Olkin (1980), Unbiasedness of invariant tests for MANOVA 
and other multivariate problems, Annals of Statistics, 8, 1326-1341, [8,10] 

Phillips, F. C. B. (1980)，The exact distribution of instrumental variable estimators in 
an equation containing n + 1 endogenous variables, Economeuica, 48, 861-878, 
[12-7] " 

Phillips, P, C, B. (1982), A new approach to small .sample theory, unpublished, Cowles 
Foundation for Research in Economics, Yale University. [12.7] 

Pillai, K, C, S. (1954), On some distribution problems in multivariate analysis、Mimeo 
Series No. 54, Institute of Statistics, University of North Carolina, Chapel Hill ， 
North Carolina. [8,6] 

Pillai, K. C. S, (1955)，Some new test criteria in multivariate analysis, Annals of 
Mathematical Statistics, 26, 117-121. [8.6] 

Pillai, K. C. S. (1956)，On the distribution of the largest or the smallest root of a 
matrix in multivariate analysis, Biomeirika^ 43, 122-127, [8.6] 

Pillai, K, C, S. (1960 )，Statistical Tables for Tc,sls of MuliiiKjnatc Hypotheses. Statistical 
Center, University of the Philippines, Manila, [8,6] 

Pillai, K. C. S, (1964)，On the moments of elementary symmetric functions of the roots 
of two matrices, Annals of Mathematical Staiisiics, 35, 1704-1712. [8,6] 

Pillai K. C, S. (1965), Ou the distribution of the largest characteristic root of a matrix 
in multivariate analysis, Biomeirika, 52, 405-412. [8.6] 

Pillai, K, C, S. (1967), Upper percentage points of the largest root of a matrix in 
multivariate analysis, Biometroka, 54, 189-194, [8.6] 

Pillai, K, C, S., and A, K, Gupta (1969)，On the exact distribution of Wilks’ criterion, 
Biomeirika, 56, 109-118. [8.4] 

Pillai, K. C. S.，and Y. S, Hsu (1979), Exact robustness studies of the test of 
independence based on four multivariate criteria and their distribution problems 
under violations, Arm ils of the Instituie of Statistical Maihemancs, 31. Part A, 
85-101, [8.6] 

Pillai, K. C. S., and K. Jayachandran (1967)，Power comparisons of tests of two 
multivariate hypotheses based on four criteria, Biomcirika, 54, |9?-2|0. [8.6] 



706 


REFERENCES 


Pillai, K, C ( S., and K, Jayachandran (1970), On the exact distribution of Pillai’s 
criterion Jourrwl of ihe American Staiwical Association^ 65, 447-454. [8,6] 

Pillai. K. C S, T and T. A. Mijares (1959), On the moments of the trace of a matrix and 
approximatiom to its distribution. Annals of Mathematical Statistics, 30, 

1135-1 [40. [8,6] 

Pillai, K, C, S*, and B. N, Nagarsenkcr (1971), On the distribution of the sphericity 
test criterion in classical and complex normal populations having unknown 
covariance matrices. Annals of Mathematical Statistics, 42 t 764-767, [10,7] 

Pillai, K. C S„ and P, Samson, Jr, (1959), On Hotelling’s generalization of T 2 t 
Bionmrika, 46, 160-168, [8,6] 

PHlaL K, C S" and T. Sugiyama (1969), Non-central distributions of the largest latent 
roots of three matrices in multivariate analysis. Annals of the Institute of Statistical 
Mathemaiics, 21, 321-327, [8,6] 

Pfllai, K, C S、and D, L, Young (1971)，On the exact distribution of Hotelling’s 
generalized T^ 2 , Journal of Multivariate Analysis^ 1, 90-107, [8,6] 

Plana, G, A. A, (1813)，Memoire sur divers problemes de probabilite, Me moires de 
I'Academie Impehate de Turin, pour les Armies 1811-1812, 20, 355-408, [1,2] 

Polya, G, U949), Remarks on computing the probability integral in one and two 
dimensions. Proceedings of (he Berkeley Symposium on Mathematical Statistics and 
Probabfliry U. Neyman, ed), 63-78. [2.P] 

Rao, C R. 11948a), The utilization of multipk measurements in problems of biologi¬ 
cal classification, Journal of the Royal Stdtisfical Society 10, 159-193, [6.9] 

Rao. C, R. (1948b). Tests of significance in multivariate analysis, Biomeirika^ 35, 
5S-79, [5,3] 

Rao, C. Radhakrishna (1951), An asymptotic expansion of the distribution of Wilks’s 
criterion 、 Bulletin of the International S(aiisdcal Institute, 33, Part 2, 177-180, [8.5] 
Rao» C, R, (1952), Advanced Stadstical Mel hods in Biometric Research ， John Wiley & 
Sons. New York. [12,5] 

Rao. C. R. (1973), Linear Siaiisdcal Inference and Its Applications (2nd ed,)，John 
Wiley & Sons. New York. [42] 

Rasch, G, (1948), A functional equation for Wisharfs distribution, Annals of Mathe¬ 
matical Statistics ， 19, 262-266. [7.2] 

Reiers0l 7 Olav (1950), On the identifiability of parameters tn Tlmrstone’s multiple 
factor analysis. Psychometnka 7 15, 121 - 149, [14.2] 

Reinsd ， G, C, ‘ and R, P, Velu (1998), Muhtuariafe Reduced-rank Regression, Springer, 
New York. [12,7] 

Richiirdson. D. H, (1968)，The exact diRtribution of a structural coefficient estimator, 
Journal of life American Si at isl ical Association, 63, I214-1226 k [12.7] 

Rothenberg, Thomas J, (1977)，Edgeworth expansions for multivariate test statistics, 
IP-255, Center for Research in Management Science, University of California, 
Berkeley, [8,6] 

Roy. S, N, (1 1 )39), p-statistics or some generalisations in analysis of variance appropri- 
aie to multivariate problems ， Sankhya, 4, 381-396, [13,2] 

Roy. S. N. (1945), The individual sampling distribution of the maximum, the minimum 
and any intermediate of the p-statistics on the null-hypothesis, Sankhya, 7, 
133-158, [8.6] 



REFERENCES 


707 


Roy, S, N, (1953)，On a heuristic method of test construction and its use in 
multivariate analysis. Annals of Mathematical SlatLstics y 24, 220-238, [8,6, 10,6] 
Roy, S» N, (1957 )， Some Aspects of Multivariate Analysis, John Wiley & Sons，New 
York, [8,6, 10.6, 10,8] ^ 

Ruben, Harold (1966), Some new results on the distribution of the sample correlation 
coefficient, Journal of the Royal Statistical Society B t 28, 513-525. [4,2] 

Rubin, Donald B,，and Dorothy T. Thayer (1982)，EM algorithms for ML factor 
analysis, Psychometrika, 47» 69-76, [14.3] 

Ryan, D, A, J, J, Hubert, E, M, Carter, J, B, Sprague, and J, Parrot (1992)，A 
reduced-rank multivariate regression approach to joint toxicity experiments* 
Biometrics, 48, 155 -162, [12,7] 

Salaevskff, Ci, V, (1968)，The minimax character of Hotelling's T 2 test (Russian )， 
Doklady Akademii Nauk SSSR, 180, 1048-1050, [5.6] 

Salaevskil, (Shalaevskii), O, V. (1971)，Minimax character of Hotelling's T 2 test. I. 
Investigations in Classical Problems of Probability Theory and Mathematical Siatis- 
1ics 7 V. M. Kalinin and O, V, Shalaevskif (Seminar in Mathematics, V. I, Steklov 
Institute, Leningrad, VoL 13)，Consultants Bureau, New York. [5,6] 

Sargan ， D.，and W, M. Mikhail (1971)，A general approximation to the distribution 
of instrumental variables estimates, Econometric^ 39, 131-169, [12.7] 

Sawa, Takaniitsu (1969)，The exact sampling distribution of ordinary least squires and 
two-stage least squares estimators, Jorunal of (he American StatLsiical Association, 
64, 923-937, [12J] 

Schatzoff, M. (1966a)，Exact distributions of Wilks’s likelihood ratio criterion, 
Biometnka, S3, 347-358, [8,4] 

Schatzoff, M (1966b)，Sensitivity comparisons among tests of the genera] linear 
hypothesis, Journal of the American Statistical Associa(ion 9 61, 415-435. [8.6] 
Scheffe, Henry (1942), On the ratio of the variances of two normal populations, 
Annals of Mathematical Statistics, 13, 371 - 388 - [10.2] 

Scheffe, Henry (1943), On solutions of the Behrens-Fisher problem, based on the 
/-distribution, Annals of Mathematical Statistics, 14, 35-44 [5,5] 

Schmidli, ft (1996), Reduced-rank Regression ， Physica, Berlin, [12.7] 

Schuurmann, F, J n P # R. Krishnaiah, and A, K, Chattopadhyay (1975), Exact percent¬ 
age points of the distribution of the trace of a multivariate beta matrix, Journal of 
Statistical Computation and Simulation, 3, 331-343, [8,6] 

Schuurmann, F, J„ and V. B. Waikar (1973)，Tables for the power function of Roy’s 
two-sided test for testing hypothesis £ = / in the bivariate case ， Communications 
in Statistics, 1, 271-280. [10.8] 

Schuurmann, F. J M V, B, Waikar, and P, R, Krishnaiah (1975)，Percentage points of 
the joint distribution of the extreme roots of the random matrix S^Si -f 5 2 )~\ 
Journal of Statistical Computation and Simulation ， 2, 17-38, [10,6] 

Schwartz, R (1967), Admissible tests in multivariate analysis of variance, Annals of 
Mathematical Statistics^ 38, 698 — 710* [8.10] 

Serfllng, Robert J, (1980), Approximation Theorema of Mathematical Statistics, John 
Wiley & Sons, New York, [4.2] 

Simaika, J. B, (1941 )， On an optimum property of rwo important statistical tests, 
Biometrika ， 32, 70-80, [5,6] 



708 


references 


Siotani, Minoru (1980), Asymptotic approximations to the conditional distributions of 
the classification statistic Z and its studentized form Z*, Tamkang Journal of 
Mathematics, 11, 19-32 - [6,6] 

Siotani, and R, H, Wang (1975), Further expansion formulae for error rates and 
comparison of the W- and the Z-procedures in discriminant analysis, Technical 
Report No. 33, Department of Statistics, Kansas State University, Manhattan, 
Kansas, [6,6] 

Siotani, and R, H, Wang (1977), Asymptotic Expansions for Error Rates and 
Comparison of the (^-Procedure and the Z-Procedure in Discriminant Analysis, 
Multivariate Analysis IV, North-Holland, Amsterdam, 523-545, [6,6] 

Sitgreaves, Rosedith (1952), On the distribution of two random matrices used in 
classification procedures, Annals of Mathematical Statistics, 23, 263-270, [6,5] 

Solari, M. E, (1969), The “maximum likelihood solution”of the problem of estimating 
a Jinear functional relationship, Journal of the Royal Statistical Society 5, 31 ， 
372-375, [14,4] 

Spearman, Charles (1904) ， w General-intelligence, n objectively determined and mea¬ 
sured, American Journal of Psychology, 15, 201-293, [14,2] 

Speed, T, P,，and H, Kiiveri (1986)，Gaussian Markov distributions over finite graphs, 
Annals of Statistics^ 14, 138-150, [15,5] 

Srivastava, M, S., and C, G, Khatri (1979), An Introduction to Multivariate Statistics, 
North-Holland, New York, [1(X9, 13,P] 

Stein, C, (1956a)，The admissibility of Hotelling’s T 2 -test t Annals of Mathematical 
Statistics, 27, 616 - 623, [5,6] ^ 

Stein, C (1956b), Inadmissibility of the usual estimator for the mean of a multivariate 
normal distribution, Proceedings of the Third Berkeley Symposium on Mathematical 
and Statistical Probability (Jerzy Neyman, ed,), VoL I, 197-206，University of 
California, Berkeley, [3.5] 

Stein, C, (1974)，Estimation of the parameters of a multivariate normal distribution, I, 
Estimation of the means. Technical Report No, 63, NSF Contract GP 30711X-2, 
Department of Statistics, Stanford University. [3,5] 

Stoica, P M and M, Viberg (1996), Maximum likelihood parameter and rank estimation 
in reduced-rank multivariate linear regressions, IEEE Transaction Signal Process- 
ing, 44, 3069 - 3078. [12J] 

Student (W, S, Gosset) (1908)，The probable error of a mean, Biometrika, 6, 1-25, 
[3.2] 

Styan, George P. H, (1990), The Collected Papers ofT, W. Anderson.. 1943-1985 7 John 
Wiley & Sons, Inc” New York. [Preface] 

Subrahmaniam, Kocherlota, and Kathleen Subrahmaniam (1973), Multivariate Analy¬ 
sis: A Selected and Abstracted Bibliography, 1957-1972, Marcel Dekker, New York, 
[Preface] 

Sugiura, Nariaki, and Hisao Nagao (1968)，Unbiasedness of some test criteria for the 
equality of one or two covariance matrices, Annals of Mathematical Statistics, 39, 
1686-1692. [10.8] 

Sugiyama, T, (1967)，Distribution of the largest latent root and the smallest latent 
root of the generalized B statistic and F statistic in multivariate analysis, Annals 
of Mathematical Statistics, 38, 1152-1159. [8,6] 



REFERENCES 


709 


Sugiyama, T” and K, Fukutomi (1966)、On the distribution of the extreme characterise 
tic roots of the matrices in multivariate analysis, Reports of Suuwtcal Applicaiicyn 
Research, Union of Japanese Scientists and Engineers, 13 . [8,6] 

Sverdrup, Erling (1947), Derivation of the Wishart distribution of the second order 
sample moments by straightforward integration of a multiple integral, Skandi- 
navisk Akluanelidskrift, 30, 151—166, [7,2] 

Tang, P, C (1938), The power function of the analysis of variance tests with tables 
and illustrations of their use ，Statistical Resecrch Memoirs, 2, 126-157, [5.4] 

Theil t H, (assisted by J, S, Cramer, H, Moerman, and A, Russchen) (1961 )，Economic 
Forecasts and Policy, (2nd rev, ed，X North-Holland, Amsterdam, Contributions to 
Economic Analysis No. XV (first published 1958), [12,8] 

Thomson, Godfrey R (193^) t Hotelling's method modified to give Spearmans '• g:’ 

Journal of Educational l sychology, 25, 366-374, [14,3] 

Thomson, Godfrey R (1951), The Factorial Analysis of Human Ability (5th ed.V 
University of London, I-ondon, [14,7] 

Thurstone, L, L. (1947), Multiple-Factor Analysis^ University of Chicago, Chicago, 
[14.2, 14.5] J ^ ‘ 

Tsay, R, S, t and G, C Tiao (1985), Use of canonical imalysis in time series model 
identification, Biometrika, 72, 299-315 - [12,7] 

Tukey, J, W, (1949), Dyadic a nova, an analysis of variance for vectors, Human Biology ， 
21, 65-110- [8-9] " 

Tyler, David E, (1981), Asymptotic inference for eigenvectors. Annals of Shiiisiics y 9 t 
725-745, [11.7] 

Tyler, David E, (1982), Radial estimates and the test for sphericity, Bio^icrrikd. 69. 

429-436, [3,6] ^ 

Tyler, David E, (1983a )； Robustness and efficiency properties of scatter matrices, 
Biometrika, 70, 411—420, [3,6] 

Tyler, David E, (1983b), The asymptotic distribution of principal component roots 
under local alternatives to multiple roots. Annals of Statistics ， 11, 1232-1242. 
[11.7] 

Tyler, David E, (1987)，A distribution free Af-estimator of multivariate scatter, AnnaLs 
of Statistics, 15, 234-251, [3,6] 

Vein, R, P,, G, C ReinseJ, and D, W, Wichern (1986)，Reduced rank models for 
multiple time series, Biometrika, 73, 105-118, [12,7] 
von Mises, R, (1945)，On the classification of observation data into distinct groups. 

Annals of Mathematical Statistics, 16, 68-73, [6.8] 
von Neumann, J. (1937)，Some matrix-inequalities and metrization of matric-space. 
Tomsk University Review^ 1 286—300, Reprinted in John von Neuman Collected 
Works (A. H. Taub, ed，)，4 (1962) ， Pergamon. New York, 205-21^. [A.4] 

Wald, A, (1943)，Tests of statistical hypotheses concerning several parameters when 
the number of observations is large ，Transaciions of ihc American Maihemaiical 
Society, 54, 426-482, [4 ( 2] 

Wald, A, (1944), On a statistical problem arising in the classification o( an individual 
into one of two groups. Annals of Maihemaiical Siaiisiics. 15, 145-162 、 [6A 6.5] 
Wald, A. (1950 )，Statistical Decision Funciions, John Wiley & Sons, New York, [6,2. 
6,7, 8.10] * 



710 


REFERENCES 


Wald ， A” and R. Brookner (1941)，On the distribution of Wilks’ statistic for testing 
the independence of several groups of variates, Annals of Mathematical Statistics^ 
12, 137-152. [8A 9J] 

Walker, Helen M, (1931 )，Studies in (he History of Statistical Method ，Williams and 
Wilkins, Baltimore. [1.1] 

Welch, P. D，and R. S, Wfmpress (1961)，Two multivariate statistical c)mputer 
programs and their application to the vowel recognition problem, Journal of the 
Acoustical Society of America^ 33, 426-434 - [6-10] 

Wermuth, N, (1980)，Linear recursive equations, covariance selection and path 
analysis. Journal o/ the American Statistical Association, 75, 963^972, [15,5] 
Whittaker, E, T., and G, R Watson (1943), A Course of Modem Analysis, Cambridge 
University, Cambridge, [8 ( 5] 

Whittaker, Joe (1990 )，Graphical Models in Applied Multioariate Statistics, John Wiley 
& Sons, 1 ik\ Chichester, [15,1] 

Wijsman. Robert A. (1979)，Constructing all smallest simultaneous confidence sets in 
a given class, with applications to MANOVA ，Annals of Statistics ， 1 、 1003-1018, 
[S:7] 

Wijsman, Robert A. (1980)，Smallest simultaneous confidence sets with applications 
in multivariate analysis. MuUiuanate Analysis, V, 483-498, [8,7] 

Wilkinson, James Hardv (1965 )，The Algebraic EigenualueProblem, Clarendon, Oxford. 
[11.4] ^ 

Wilkinsoa J. and C, Reinsch (1971 )，Linear Algebra, Springer-Verlag, New York, 
[1L4] 

Wilks, S, S* (1932), Certain generalizations in the analysis of variance, BiomeMka, 24, 
471-494. [7' 8,3, ltuf 

Wilks, S- S. (1934), Moment-gcncradng opcraiors for determinants of product mo¬ 
ments in samples from a normal system. Annals of Mathmatics, 35, 312-340. 
[8.3] 。 

Wilks, S, S. (1935), On the independence of k sets of normally distributed statistical 
v ariables, Economethca, 3, 309-326. [8,4, 9.3, 9,P] 

Wishart, John (1928), The generalised product moment distribution in samples from a 
normal multivariate population, Blometrika ， 20A, 32-52, [7,2] 

Wishart, John (1948)，Proofs of the distribution law of the second order moment 
statistics, Biometrika^ 35, 55 - 57, [12] 

Wishart. John, and Me S. Bartlett (1933), The generalised product moment distribu¬ 
tion in a normal system, Proceedings of the Cambridge Philosophical Society, 29, 
260-270. [7.2] 

Wold, H ， D ， A, (1954), Causality and econometrics, Econonu nica 22, 162 - 177, [15,1] 
Wold, H. D, A. (I960), A generalization of casual chain models, Econometrica, 28, 
443-463. [15.1] 

Woltz, W ， G” W. A. Reid, and W. E, Colwell (1948), Sugar and nicotine in cured 
bright tobacco as related to mineral element composition, Proceedings of the Soil 
Sciences Society of America ， 13, 385-387. [8,P] 

Wright, Sewall (1921), Correlation and causation, Journal of Agiicultural Research, 20, 
557-585, [15.1] 



REFERENCES 


711 


Wright, Sewall (1934)，The method of path coefficients. Annals of Mathematical 
Statistics, 5, 161-215. [15,1] 

Yamauti, Ziro (1977 )， Concise Statistical Tables ， Japanese Standards Association 
[Preface] 

Yates, F，，and W ， G. Cochran (1938), The analysis of groups of experiments. Journal 
of Agricultural Science, 28, 556. [8.9] 

Yule, G. U. (1897a), On the significance of Bravais’ formulae for regression & c,，in 
the case of skew correlation, Proceedings of the Royal Society of London, 60, 
477 - 489. [2.5] ^ 

Yule, G. U, (1897b), On the theory of correlation, Journal of the Royal Statistical 
Society, 60, 812-854. [2.5] 

Zehna, P, W, (1966)，Invariance of maximum likelihood estimators, Annals of Mathe¬ 
matical Statistics, 37, 744. [3.2] 




Index 


Absolutely continuous distribution, 8 
Additivity of Wishart matrices, 259 
Admissible, definition of, 88, 210, 235 
Admissible procedures, 235 
Admissible test, definition of, 193 
Stein's theorem for, 194 
Almost invariant test, 192 
Analysis of variance, random effects model t 
429 

likelihood ratio test of, 431 
See also Multivariate analysis of variance 
Anderson, Edgar, Iris data of. 111 
A posteriori density, 89 
of 8() 
of fx and L, 274 
of fi,, given x and 5, 275 
of E, 273 

Asymptotic distribution of <i function, 1,12 
Asymptotic expansions of distributioas of 
likelihood ratio criteria, 321 
of gamma function, 318 

Barley yields in two years, 349 
Bartlett decomposition, 257 
Bartlett-Nanda-Pillai trace criterion, see 
Linear hypothesis 

Bayes estimator of covariance matrix, 273 
Bayes estimator of mean vector, 90 
Bayes procedure, 89 
extended, 90 
Bayes risk, 89 
Bernoulli polynomials, 318 
Best estimator or covariance matrix 
invariant with respect to triangular linear 
transformations, 279, 281 
proportional to sample covariance matrix. 
277 


Best linear prediclor, 37. 497 
und prcdictand. 497 

See also Canonical correlaiions and variaies 
Best linear unbiased estimaior. 298 
Beta distribution, 177 
Bhattacharya s estimator of the mean, 99 
Bivariate normal density, 21, 35 
distribution, 21. 35 
computation of, 23 
Bootstrap method. 135 

t 17 

Canoniciil un«ilysis of regression cuefficicrni、. 
508 

sample. 510 

Canonical correlations und variates. 487, W 
asymplolic disirihulion of sumplc 
corrdaiitHix SOS 
coni])ui«Uion of, 501 
distrihulion of sample, 545 
invariance of、496 

maximum likelihood estimators of, 501 
sample, 500 

testing number of nonzero correlations. 504 
use in testing hypotheses of rank of 
covariance matrix, 504 
Causal chain, 605 

Central limit theorem, multivariate, 86 
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example, 240 
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normal, 34 

Conditional prohability, 12 
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canonical, see Canonical correlations and 
variates 

confidence interval for, 128 
by use of Fishers z, 135 
distribution of sample ， asymptotic, 133 
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geometric interpretation of sample，72 
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power of, 128 
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Covariance matrix, 17 
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Bayes estimator of, 273 
characterization of distribution of sample, 
77 
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442 

consistency of sample as estimator of 
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distribution of sample, 255 
estimation, see Best estimator of 
geometrical interpretation of sample, 72 
with linear structure, 113 
maximum likelihood estimator of, 70 
computation of f 70 
when the mean vector is known, 112 
of normal distribution, 20 
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singular, 31 
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multivariate normal，20 
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symmetric matrix, 642 
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Characteristic roots; Correlation 
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Generalized variance, Mean vector; 
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Multivariate normal density; Multivariate 
/-distribution; Noncentral chi-squared 
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Distribution of matrix of sums of squares and 
cross-products, see Wishart distribution 
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Efficiency of vector estimate, definition of，85 
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Ellipsoid of constant density, 32 
BHiptically contoured distribution, 47 
characteristic function of, 53 
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distribution of, 482, 564 
correlation coefficient, asymptotic 
distribution of, 159 
covariance of, 50 , 
covariance of sample covariance, 101 
covariance of sample mean, 101 
cumulants of, 54 
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distribution of, 451 

likelihood ratio criterion for independence 
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Ellfptically contoured matrix (Continued) 
likelihood /atio criterion for independence 
of sets, distribution of, 408 
likelihood rat，o criterion for linear 
hypotheses, distribution of, 373 
rectangular coordinates, distribution of, 285 
stochastic representation* 160, 285 
sufficient statistics，160 
redistribution of, 200 
Equiangular line，72 
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Expected value of complex-valued function, 
41 

Exponential family of distributions, 194 
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Factor analysis, 569 
centroid method, 586 
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EM algorithm, 580, 593 
exploratory, 574 
general factor, 570 
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model, 570 
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orthogonal factors, 57t 
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simple structure, 573 
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transformations, 588 
tests of fit for, 581 
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van max criterion, modified, 589 
Factorization theorem, 83 
Fisher's z, 133 

asymptotic distribution of, 134 
moments of, 134 

See also Correlation coefficient； Partial 
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Gamma function, multivariate, 257 


Generalized analysis of vanancs, see 
Multivariate analysis of variance; 
Regression coefficients and function 
Generalized T 2 y see r 2 -test and statistic 
Generalized variance, 264 
asymptotic distribution of sample, 270 
distribution of sample, 26B 
geometric interpretation of sample in N 
dimensions, 267 
in p 山 •mcasions，268 
invariance of, 465 
moments of sample, 269 
sample, 265 

General linear hypothesis, see Linear 
hypothesis, testing of; Regression 
coeffcients and function 
Gram-Schmidt ortho go nalization, 252, 647 
Graphical models, 595 
adjacent, nonadjacent vertices, 596 
AMP (Andenson-Madigan-Perlman) 
Markov chain, 612 
ancestor, 605 
boundary, 598 
chain graph, 630 
chi'd. 605 
clique, 602 
closure, 598 
complete, 602 
decomposition, 603 
descendant, nondescendant, 605 
edges, 595 

directed, undirected, 596 
LWF (L iuritzen-Wermuth-Frydenberg) 
Markov chain, 610 
Markov properties, 597 
globally, 600 
locally, 598 
pairwise, 597 
moral graph, 608 
nodes，595 
parent, ()05 
partial ordering, 605 
path, 600 

recursive factorization, 609 
separate, 600 
vertices, 595 
well-numbered, 607 

Haar invariant distribution of orthogonal 
matrices, 162, 541 
conditional distribution, 542 
Hadamard’s inequality, 6t 
Head lengths and breadths of brothers, 109 
Hotelling’s r 2 , r 2 -test and statistic 
Hypergeometric function, 126 
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Incomplete beta function, 329 
Independence, 10 
mutual, 11 

of normal variables, 26 
of sample mean vector and sample 
covariance matrix, 77 

tests of, see Correlation coefficient; Multiple 
correlation coefficient; Testing 
independence of sets of variates 
Information matrix, 85 

Integral of a symmetric unimodal function over 
a symmetric convex set，365 
Intraclass correlation, 484 
Invariance, see Classification into normal 
populations; Correlation coefficient; 
Generalized variance; Linear hypothesis； 
Multiple correlation coefficient; Partial 
correlation coefficient; r 2 -test; Testing 
that a covariance matrix is a given matrix; 
Testing that a covariance matrix is 
proportional to a given matrix; Testing 
equality of covariance matrices； Testing 
equality of covariance matrices and means 
vectors; Testing Independence of sets of 
variates 

Inverted Wish art distribution 272 
Iris, four measurements on, 110, 180 

Jacobian, 13 

James-Stejn estimator, 91 
for arbitrary known covariance matrix, 97 
average mean squared error of, 95 

Kronecker delta, 75 
Kronecker product of matrices, 643 
characteristic roots of, 643 
determinant of, 643 
Kurtosib% 54 
estimation of, 103 

Latin square, 377 

Lawley-Hotelling trace criterion ，see Linear 
hypothesis 

Least squares estimator, 295 
Likelihood, induced, 71 
Likelihood function for sample from 
multivariate normal distribution, 67 
Likelihood loss function for covariance matrix, 
276 

Likelihood ratio test, definition of, 129. See 
also Correlation coefficient； Linear 
hypothesis; Mean vector, Multiple 
correlation coefficient； Regression 


coefficients; 7--test; Testing that a 
covariance matrix is given matrix; Testing 
that a covariance matrix fs proportional 
to given matrix; Testing that a covariance 
matrix and mean vector are equal to a 
given matrix and vector; Testing equality 
of covariance matrices; Testing equality 
of covariance matrices and mean vectors； 
Testing independence of sets of variates 
Linear combinations of normal variables, 
distribution of, 29 
Linear equations, solution of, 606 
by Gaussian elimination, 607 
Linear functional relationship. 513 
relation to simultaneous equations, 520 
Linear hypothesis, testing of 
admissibility of, 353 
necessary condition for、363 
Bartlett-Nanda-Pillai trace criterion, 331 
admissibility of, 379 
asymptotic expansion of distribution 
of，333 

as Bayes procedure, 378 
table of significance points of, 673 
tabulation of power of, 333 
canonical form of，303 
comparison of powers, 334 
invariance of criteria，327 
Lawley-Hotelling trace criterion, 328 
admissibility of, 379 
asymptotic expansion of distribution 
of, 330 

monotonicity of power function of, 368 
table of significance points of, 657 
tabulation of，328 
likelihood ratio criterion. 300 
admissibility of, 378 
asymptotic expansion of distribution 
of, 321 

as Bayes procedure, 378 
distributions of, 306、 3 [0 
厂 approximation to distribution of, 326 
geometric interpretation of, 302 
moments of, 309 

monotonicity of power function of, 368 
normal approximation to distribution 
of, 323 

table of significance points, 651 
tabulation of distribution of, 314 
Wilks' A, 300 

monotonicity of power function of an 
invariant test，363 
Roy’s maximum root criterion, 333 
distribution for p - 2, 334 
monotoniciry of power function of, 368 
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Linear hypothesis (Continued) 

table of significance points, 677 
tablulation of distribution of s 333 
^tep-down test, 314 

See also Regression coefficients and function 
Linearly independent vectors, 627 
Linear transformation of a normal vector, 23 ， 
29, 31 
Loss, 88 

LR decomposition. 630 

Mahalanobis distance, 80、 217 
sample. 228 
[Vfnjorization, 355 
weak, 355 
Marginal density, 9 
distribution. 9 
normal. 27 

Mathematical expectation, 9 
Matrix, 624 
bidiagonal, upper, 503 
characteristic roots and vectors of, see 
Characteristic roots and vectors 
cofactor in, C 
convexity, 358 
definition of, 624 
diaconalization of symmetric, 631 
doubls stochastic, 646 
eigenvalue . 卿 Characteristic roots and 
vectors 

Givens, 471, 649 
Householder. 470, 650 
idempotent. 635 
identity, 626 
inverse, 627 
minor of, 627 
nonsingular, 627 
operations with, 625 
positive definite、628 
positive uemidermite, 628 
rank of, 628 
symmetric, 626 
trace of, 629 
transpose, 625 
triangular, 629 
tridiagonal, 470 

Matrix of sums of squares and cross-producis 
of deviations from the means, 68 
Maximum likelihood estimators, see Canonicul 
correlation?? and variates: Correlation 
coefficient: Covariance matrix; Mean 
vector ； Multiple correlation coefficient; 
Partial correlation coefficient; Principal 
componenis; Regression coefficients ； 
Variance 


Maximum likelihood estimator of function of 
parameters, 71 

Maximum of the likelihood function, 70 
Maximum of variance of linear combinations, 
464. See also Principal components 
Mean vector, 17 

asymptotic normality of sample, 86 
completeness of sample as an estimator of’ 
population, 84 

confidence region for difference of two 
when common covariance matrix is 
known, 80 

when covariance matrix is unknown, 

180 

consistency of sample as estimate of 
population, 86 
distribution of sample, 76 
efficiency of sample, 85 
improved c.stimaior when covariance matrix 
is unknown, 185 

maximum likelihood estimator of, 70 
sample, 67 

simultaneous confidence regions for linear 
functions of, 178 

testing equality of, in several distributions, 
206 

testing equality of two when common 
covariance matrix is knosvn，80 
tests of hypothesis about 
when covariance matrix is known, 80 
when covariance matrix is unk nown, see 
r 2 -test 

See also James-Stein estimator 
Mini max，90 

Missing observations, maximum likelihood 
esnmators, 168 
Modulus, 13 
Moments, 9, 41 
factoring of, 11 

from marginnl distributions, 1() 
of normal distributions，46 
Monoione region, 355 
in majorization, 355 
Multiple correlation coefficient 
adjusted, 153 
distribulion of sample 
conditional, LS4 

when population correlation is not 
zero. 156 

when population correlation i,s zero, 

150 

geometric interpretation of sample, 148 
invariance nf population, 60 
invariance of sample, 166 
likelihood icsi lh}it ii is zero, 151 
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as maximum correlation between one 
variable and linear combination of other 
variables, 38 

maximum likelihood estimator of, 147 
moments of sample, 156 
optimal properties of, 157 
population, 38 
sample, 145 

tabulation of distribution of, 1/7 
Multivariate analysis of variance (MANOVAX 
346 

Latin square, 377 
one-way, 342 
two-way, 346 

See also Linear hypothesis，testing of 
Multivariate beta distribution, 377 
Multivariate of gamma function，257 
Multivariate normal density, 20 
distribution, 20 
computation of, 23 
Multivariate /-distribution, 276, 289 

20 

20 

Neyman-Pearson fundamental lemma, 248 
Noncentral chi-squared distribution, 82 
Noncentral F-distribution, 186 
tables of, 186 

Noncentral redistribution, 186 

O(A^X^), 161 
Orthonormal vectors, 647 

Parallelotope, 266 
volume of, 266 
Partial correlation coefficient 
computational formulas for, 39, 40, 41 
confidence intervals for, 143 
distribution of sample, 143 
geometric interpretation of sample, 138 
invariance of population, 63 
invariance of sample, 166 
maximum likelihood estimator of, 138 
in the population, 35 
recursion formula for, 41 
sample, 138 
tests about, 144 
Partial covariance, 34 
estimator of, 137 
Partial variance, 34 
Partioning of a matrix, 635 
addition of, 635 
of a covariance matrix, 25 
(icicrininant of, 637 


inverse of, 638 
multiplication of, 635 
Partioning of a vector, 635 
of a mean vector, 25 
of a random vector, 24 

Path analysis, 596, See also Graphical models 
Pearson correlation coefficient, see 
Correlation coefficient 
Plane of closest fit, 466 
Polar coorilinalcs, 285 
Positive definite matrix, 628 
Positive part of a function, 96 
of the Junic.^Stcin estimator, ^7 
Positive semi definite matrix, 628 
Precision matrix, 272 
unbind estimator of T 274 
Principal axes of ellipsoids of constant density, 
465, See also Principal components 
Principal component^ 459 
asymptotic distribution of sample, 473 
computation of, 469 
confidence region for, 475 , 477 
distribution of sample, 540, 542 
maximum likelihood estimator of, 467 
population, 464 

testing hypotheses about ， 478, 479, 480 
Probability element, 8 
Product-moment correlation coefficient, see 
Correlation coefficient 

QL algorithm, 471 / 

QR algorithm, 471 
decomposition, 647 
Quadratic form, 628 

Quadratic loss function for covariance matrix, 
276 

^ (rciil pnrt), 257 
Randoni nuitrix, 16 
expected value of, 17 
Random vector, 16 
Randomized tesi、definition of, 192 
Rectangular coordinates, 257 
distribution of, 255, 257 
Reduced rank regression, 514 
estimator, asymptotic distribution of, 

550 

Regression cocfticicnts and function, 34 
confidence regions for, 339 
distribution of sample, 297 
geometric interpretation of sample, 138 
maximum likelihood estimator of, 294 
pariial corrclaiion, connection with, 61 
simple, 2 ( )4 
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Regression coefficients (Continued) 
simultaneous confidence intervals for, 340, 
341 

testing hypotheses of rank of, 512 
testing they are zero, in case of one 
dependent variable, 152 
Residuals from regression, 37 
Risk function, 88 

Selection of linear combinations, 201 
Simple correlation coefficient, See Correlation 
coefficient 

Simultaneous equations, 513 
estimation of coefficients, 518 
[east variance ratio (LVR), 519 
limited information maximum likelihood 
(LI ML), 519 

two stage least squares (TSLS), 522 
identification by zeros, 516 
reduced form, 516 
estimation of, 517 
Singular normal distribution, 30, 31 
Singular value decomposition, 498, 634 
Spherical distribution, 105 
left, 105 
right, 105 
vector, 105 

Spherical normal distribution, 23 
Spherically contoured distribution 47 
stochastic representation, 49 
uniform distribution, 48 
Sphericity test, see Testing that a covariance 
matrix is proportional to a given matrix 
Standardized sum statistics, 201 
Standardized variable, 22 
Steifel manifold, 162 
Stochastic convergence, 113 
of a sequence of random matrices, 113 
Sufficiency, definition of, 83 
Sufficiency of sample mean vector ；md 
covariance matrix, 83 
Surface area of unit sphere, 286 
Surfaces o 1 constant density, 22 
Symmetric matrix, 626 

7 2 -statistic, H6, See also 7 2 -test and statistic 
r 2 -test and statistic, 173 
admissibility of, 196 
as Bayes procedure, 199 
distribution of statistic, 176 
geometric interpretation of statistic, 174 
invariance of, 173 

as likelinood ratio test of mean vector, 176 
limiting distribution of, 176 
noncentral distribution of statistic, 186 


optimal properties of, 190 
power of, 186 
tables of, 186 

for testing equality of means when 
covariance matrices are different, 187 
for testing equality of two mean vectors 
when covariance matrix fe unknown, 179 
for testing symmetry in mean vector, 182 
as uniformly most powerful invariant test of 
mean vector, 191 

Testing that a covariance matrix is a given 
matrix, 438 
invariant tests of, 442 
likelihood ratio criterion for, 438 
modified likelihood ratio criterion for，438 
asymptotic expansion of distribution of ， 
442 

moments of, 440 
table of significance points, 685 
Nagao's criterion, 442 
Testing that a covariance matrix is 

proportional to a given matrix, 431 
invariant tests, 436 
likelihood ratio criterion for，434 
admissibility, 458 

asymptotic expansion of distribution of, 
435 

moments of, 434 
table of significance points, 682 
Nagao’s criterion, 437 

Testing that a covariance matrix and mean 
vector are equal to a given matrix and 
vector, 444 

likelihood ratio criterion for t 444 
asymptotic expansion of distribution of, 
446 

moments of, 445 

Testing equality of covariance matrices, 412 
invarianct tests, 428 
likelihood nitio criterion for, 413 
invariance of, 414 

modified likelihood ratio criterion for，413 
admissibility of, 449 
asymptotic expansion of distribution of, 

425 

distribution of, 420 
moments of, 422 
table of significance points, 681 
Nagao’s criterion for, 415 
Testing equality of covariance matrices and 
mean vectors，415 
likelihood ratio criterion for，415 
asymptotic expansion of distribution of, 

426 

distribution of, 421 
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moments of t 422 
unbiasedness of, 416 
Testing independence of sets of variates, 

381 

and canonical correlations, 504 
likelihood ratio criterion for，384 
admissibility of, 401 

asymptotic expansion of distribution of, 
390 

distribution of, 388 
invariance of, 386 
moments of, 388 

monotonicity of power function of, 404 
unbiasedness of, 386 
Nagao’s test, 392 

asymptotic expansion of distribution of, 
392 

step down tests, 393 

Testing rank of regression matrices, 512 
Tests of hypotheses, see Correlation 
coefficient; Generalized analysis of 
variance; Linear hypothesis; Mean vector ； 
Multiple correlation coefficient; Partial 
correlation coefficient; Regression 
coefficients ； r 2 -test and statistic 
Tetrachoric functions, 23 


Total correlation coefficient, .see Correlation 
coefficient 

T, ace of a matrix, 629 
Transformation of variables, 12 

Unbiased estimator, definition of, 77 
Unibased test, definition of, 364 
Uniform distribution on unit sphere, 48 
on 0(N X p\ 162 

Variance, 17 

generalized, see Generalized variance 
maximum likelihood estimator of, 71 

wU\L,n\ 255 
mZ,n\ 255 

272 

272 

Wjshart distribution, 256 
characteristic function of, 258 
geometric interpretation of, 256 
marginal distributions of T 260 
noncentral, 587 
for ^ 2,124 

z % see Fishers z 
Zonal polynomials, 473 



